Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: harden typetracer operations #1849

Closed
wants to merge 91 commits into from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 31, 2022

Awkward's nplike mechanism currently exposes implementation details w.r.t the underlying array libraries used. This poses a problem when these implementations differ, or when we need to replicate them for typetracer, which has no backing array module.

To improve this, we should more strongly define the NumpyLike API, and move away from value-based behaviour that applies to a subset of operations. This will make it easier to ensure that our nplikes are well behaved and consistent, at a cost of writing more code than the simple shims that we currently define.

Primary Goals

  • Typed signatures for NumpyLike
  • Value-less result types

Secondary Goals

The secondary goal of this PR is to make the UnknownScalar object complain if it is used in a concrete context. This will break existing usages, but I think this change will make it harder to write code that has implicitly wrong behavior when a typetracer array is passed through. I.e., this change will require more if nplike.known_data in our content classes. I think this is a good change, though; the NumpyLike mechanism can't anticipate how users of the nplike will interpret the data, so we should surface this logic explicitly where it matters.

Implementation

The NumpyLike mechanism serves two purposes:

  1. Standardise the NumPy API across different array libraries
  2. Implement Awkward-specific array-handling utility functions in a cross-library manner

(1) will ultimately be handled by the Array API that NumPy et al. are implementing. This stricter subset of the NumPy API provides guarantees about output types, and is designed for cross-compatibility between array libraries. It is not ready, however, for use across all NumPy versions, and does not yet cover the full suite of types and operations (?) that we need. The latest draft adds support for complex datatypes, but datetime objects remain missing. There are several reasons why we cannot simply adopt the Array API outright:

  • Lack of support in old versions (NumPy is too new, NEP 47 requires v1.22)
  • Lack of support for datetime64 / complexXXX?
  • Lack of support for ordering (Fortran, C)

Therefore, the NumpyLike API will deviate from the Array API standard in order to support this information. In general, it will be preferable to add special NumpyLike methods e.g. as_contiguous() over new parameters to the NumpyLike API. In addition to new parameters, there are also parameters in the Array API specification that may not be appropriate for nplike. It would be preferrable to define unused arguments anyway, so that we can later replace NumpyLike methods with thin shims over the respective Array API object, i.e.

class Numpy(NumpyLike):
    reshape = numpy.array_api.reshape

We could implement an Array API namespace for typetracer, such that the TypeTracerArray can be operated upon by third-party code that supports the Array API. There is no strong motivation for this; NumpyLike should remain an internal detail, and no users should ever see e.g. TypeTracerArray in their interactions with Awkard. Because of this, there is little argument to be made for directly using the Array API in code that currently consumes the NumpyLike in future given that we still have to consider (2).

(2) will remain an important motivation for NumpyLikes. We have small additions like is_eager and is_own_array that abstract library-specific features. We could remove is_own_array as array.__array_namespace__() is array_api_obj, but these other features remain important. Furthermore, we want to flavour a layout by its nplike, so we need to keep the "expected" array API information in the layout somewhere. We don't want to have to implement the entire Array API for e.g. TypeTracer, so it's best that we go through the NumpyLike mechanism in order to explicitly define the available functions.

Tasks

  • Make UnknownScalar a non-concrete object
  • Make TypeTracer.add etc. complain for non-predictable dtypes
  • Ensure ListArray._pad_none always adds option
  • Add finfo and iinfo to NumpyLike
  • Implement timelike promotion rules
  • Replace nplike.ndarray usage with nplike.is_own_array

📚 The documentation for this PR will be available at https://awkward-array.readthedocs.io/en/agoose77-refactor-typetracer/ once Read the Docs has finished building 🔨

@agoose77 agoose77 marked this pull request as draft October 31, 2022 11:57
@agoose77
Copy link
Collaborator Author

Note that, under NumPy's rules, the result of array + scalar is not predictable; NumPy calls np.min_scalar_type() on the scalar, which is value-dependent. Being rigorous, this should fail if attempted, hence TypeTracer.min_scalar_type() will accept concrete-scalars, but fail for UnknownScalar.

@agoose77 agoose77 force-pushed the agoose77/refactor-typetracer branch from bd53abd to a1fa4b8 Compare October 31, 2022 13:47
@jpivarski
Copy link
Member

I recognize why we'd want to be rigorous like this, but that could easily make common operations fail to be Dask-delayable.

Fortunately, we have a model: what does dask.array do?

>>> import numpy as np, dask.array as da
>>> np.array([1, 2, 3], np.uint8) + np.uint16(10)
array([11, 12, 13], dtype=uint8)
>>> np.array([1, 2, 3], np.uint8) + np.uint16(300)
array([301, 302, 303], dtype=uint16)

Okay, so NumPy promotes the dtype if the value is large enough (value dependent).

>>> da.array(np.array([1, 2, 3], np.uint8)) + da.array(np.uint16(10))
dask.array<add, shape=(3,), dtype=uint16, chunksize=(3,), chunktype=numpy.ndarray>
>>> da.array(np.array([1, 2, 3], np.uint8)) + da.array(np.uint16(300))
dask.array<add, shape=(3,), dtype=uint16, chunksize=(3,), chunktype=numpy.ndarray>

Dask promotes if the type is large enough.

TypeTracer is designed first and foremost for dask-awkward; it hasn't been used in anything else and it's still a private implementation, anyway. If it acquires another purpose, we can think about generalizing it, but for now, we should think of it as 100% supporting dask-awkward.

Since dask.array departs from NumPy semantics, dask-awkward can, too, and so TypeTracer should do so as well, to support it. Since the NumPy semantics is value-based and any delayed calculation cannot be (without introducing an "unknown dtype"), the only two choices are to forbid the operation—which is what I think you're saying you're implementing—or to depart from NumPy semantics for some but not all of the values in the type. (In the above, the uint16 values from 256 onward are the same as NumPy semantics; the values from 0 to 255 are not.) Since dask.array takes the relaxed option, we should, too.

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 31, 2022

Since dask.array departs from NumPy semantics, dask-awkward can, too, and so TypeTracer should do so as well, to support it. Since the NumPy semantics is value-based and any delayed calculation cannot be (without introducing an "unknown dtype"), the only two choices are to forbid the operation—which is what I think you're saying you're implementing—or to depart from NumPy semantics for some but not all of the values in the type. (In the above, the uint16 values from 256 onward are the same as NumPy semantics; the values from 0 to 255 are not.) Since dask.array takes the relaxed option, we should, too.

So, my overarching concern here is for ak.types.Type to be predictable under typetracer, i.e. typetracer should be useable as a mechanism to discover the type. It's my understanding that this is the fundamental purpose of having typetracer. Therefore, I would imagine that we don't want to produce different-typed results for typetracer vs the non-typetracer branches; otherwise, we lose the predictive value of typetracer. So, to be clear, my goal in this PR is to increase rigorousness with the goal of making the dask-awkward usage much safer (i.e., not breaking dask-awkward).

To be explicit, this currently isn't the case if we have (or add in future) any code that evaluates nplike.array(...) + nplike.array(...)[0]; the dtype of the result will depend upon the value of the item that is pulled out of the array. This would be a problem if the result of this expression ended up in a NumpyArray.

With respect to the value-dependence of array + scalar (anything that uses np.result_type()), my solution is so-far indeed to ban this (and that's what TypeTracer.result_type() does). If we want the type to be predictable, we either need to ban these operations (and require that the caller specify the dtype), or enforce the results for all of our NumpyLike implementations.

Thus far, we may never rely on this kind of behavior being identical between NumpyLikes in a meaningful way. But, anywhere we consume the result of e.g. nplike.add(array, scalar), we would need to be casting the result to the correct dtype to ensure that we don't get a runtime error by assuming TypeTracer or NumPy behavior.

The "robust" solution here is to use the new Array API standard: https://numpy.org/neps/nep-0047-array-api-standard.html, which is effectively what NumpyLike is intended to do. Unfortunately, our lower bound for NumPy doesn't support this. That's another point, though (and for now, insufficient as NEP47 does not implement datetime64 or complex support, amongst other things).

@jpivarski I originally wrote that ak.forms.Form should be predictable (i.e same form produced if the nplike is TypeTracer or Numpy), but I walked that back slightly as I imagine there are places where this isn't true (and I've just introduced one in a1fa4b8 and 59bb711). I assume you agree with my assertion that typetracer is useful for its predictive power, where do you stand on the form vs type predictability? Note that in this discussion I can't concretely picture how awkward-dask uses typetracer, i.e. how exact typetracer needs to be.

@jpivarski
Copy link
Member

Right, we don't want a TypeTracer Array and a non-TypeTracer Array to return different results, because then the dask-awkward operation would still fail, but it would do so when it's pulling back the results and testing them against the expected Form.

How about if we follow dask.array's choice of promoting dtype by type only (not value), regardless of whether it's a TypeTracer or not? If I remember right, that's probably already the implementation. This would be an example of not adhering strictly to the (current) NumPy API as you've suggested in the past—a user who relies on dtypes being set by np.min_scalar_type() is not making good life-decisions. If anyone complains that this is different from what NumPy does, we can explain the reasons; it's really well justified here.


The two commits that loosen TypeTracer/non-TypeTracer agreement to Type, rather than Form, is something that I've been on the fence about. dask-array only needs to know Types; ensuring that TypeTracers reproduce Forms was a decision to be strict so that we wouldn't need to tighten it (which is much harder than loosening it) if we ever find out that we need Forms. Numba and RDataFrame interfaces need to know Forms, but they can separately compile on each worker, so it's not necessary to give dask-awkward an exact Form.

dask-awkward has been out for a while, and Uproot is using/testing it, though not many users are. At some point, we'll be comfortable enough to say that Dask isn't ever going to need to know exact Forms, and we can then loosen the agreement to Type across the board. It seems to me that a1fa4b8 and 59bb711 are not necessary, so let's hold off on them until we make this decision globally. I think it will probably happen, and it's not API-breaking, so we can do it after users have been working with 2.x for a while.

Anyway, this point about loosening TypeTracer agreement to Type is unrelated to the dtype choice above because dtype is a Type thing.

@agoose77
Copy link
Collaborator Author

We had a discussion on Slack, and here's a summary:

  • Our nplikes are currently somewhat unstandardised, and this is most notable with the value-dependence of nplikes.Numpy vs nplikes.TypeTracer.
  • We need to eliminate value dependence so that typetracer can reliably predict the result of an operation
  • Even if we don't trigger this right now, it's a footgun-in-waiting
  • We can look to the Array API to define what we should be doing, even if we can't use it due to poor type support of complex and datetime

This PR will therefore walk back 59bb711 and a1fa4b8 which loosen the predictive power of typetracer, and work on standardising the nplikes.

@jpivarski jpivarski added the pr-next-release Required for the next release label Oct 31, 2022
@agoose77 agoose77 mentioned this pull request Nov 2, 2022
6 tasks
@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Merging #1849 (4bd3f61) into main (b83d9dc) will decrease coverage by 0.25%.
The diff coverage is 69.93%.

❗ Current head 4bd3f61 differs from pull request most recent head 9956395. Consider uploading reports for the commit 9956395 to get more accurate results

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/nplikes.py 67.31% <50.00%> (ø)
src/awkward/_typetracer.py 73.31% <66.92%> (-0.85%) ⬇️
src/awkward/_broadcasting.py 93.41% <100.00%> (+4.49%) ⬆️
src/awkward/contents/indexedoptionarray.py 88.94% <100.00%> (+0.39%) ⬆️
src/awkward/contents/listarray.py 89.52% <100.00%> (-0.93%) ⬇️
src/awkward/contents/unmaskedarray.py 66.23% <100.00%> (-6.77%) ⬇️
src/awkward/operations/ak_full_like.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_mean.py 68.00% <100.00%> (ø)
src/awkward/operations/ak_transform.py 65.51% <0.00%> (-25.79%) ⬇️
src/awkward/typing.py 66.66% <0.00%> (-24.25%) ⬇️
... and 147 more

@agoose77 agoose77 force-pushed the agoose77/refactor-typetracer branch from 0c69b73 to a4ea6b0 Compare November 14, 2022 16:27
@agoose77
Copy link
Collaborator Author

agoose77 commented Nov 15, 2022

This PR looks scary, but it's fairly straightforward so far.

The plan here is still to formalise the NumpyLike type promotion rules, and implement an API that closely resembles the array API. We won't be able to actually invoke the Array API whilst we want to support more types, but we can ensure that we look like the Array API for future ease / API decisions.

Following the Array API is so far a rewarding process; the interfaces are simple, and predictably typed. The area that's currently a bit tricky is working out result dtypes; we need to define our own results here, because the Array API doesn't include complex numbers (draft) or datetime/timedeltas yet. I'm happy that we've got the approximate rules in place, but it's not clear which rules NumPy applies; np.result_type matches our promotion rules, but things like np.stack have special promotion rules. It seems like this is just a same-kind check, which we can do.

It looks like these types are unlikely to be standardised by the Array API any time soon, so we'll need to decide upon this for ourselves. Note that, we don't need to implement what NumPy does here; we just need to choose something, and stick to it. Ideally, we'll choose something that's a subset of NumPy's behavior so that we can directly use the corresponding methods. Again, this is an internal part of Awkward, so we're able to impose these kind of constraints upon ourselves.

NumPy uses these rules for concat / stack: https://github.com/numpy/numpy/blob/fbe1b65f5f66ca5a416b23194577a1ed5a4cf1bd/numpy/core/src/multiarray/datetime.c#L1668-L1701

RE complex types, we currently allow float-complex mixing, as it can be a lossless conversion.

I'll give this some more thought.

@jpivarski
Copy link
Member

Yes, complex types seem pretty straightforward, since they are exact supersets of the corresponding floating point types.

For date-time types, we can refuse every type of mixing except for the few mathematical operations that are defined (with commutative symmetry for +):

  • datetime64 - datetime64 → timedelta64
  • datetime64 + timedelta64 → datetime64
  • datetime64 - timedelta64 → datetime64
  • number * timedelta64 → timedelta64
  • timedelta64 / number → timedelta64
  • timedelta64 + timedelta64 → timedelta64
  • timedelta64 - timedelta64 → timedelta64

I don't think there are any other mathematical operations that apply to date-time types.

I looked at the NumPy reference (and also found this), and was drawn to the comment about "nonlinear time units." I wondered what they meant by that. It's the fact that NumPy accepts time units like "months" and "years," which can have a variable number of days, depending on their absolute time position. THAT is a huge can of worms. We should avoid ever computing

  • datetime64 + timedelta64 → datetime64
  • datetime64 - timedelta64 → datetime64

ourselves—let NumPy deal with the possibility that timedelta64 is expressed in months or years, and then the number of days that get added to the datetime64 depends on the value of the datetime64.

I think we do let NumPy handle all of this, since the above are ufuncs, and reducers like sum and prod don't mix arrays, so our reducers don't see mixed units.

You asked about cases like concatenate, not mathematical functions. We should be able to get NumPy to cast them to common units, right?

@agoose77
Copy link
Collaborator Author

Whilst I remember, with respect to the type inspection interface, the standard doesn't yet implement everything, but I'm planning to take this as inspiration: data-apis/array-api#425

Yes, complex types seem pretty straightforward, since they are exact supersets of the corresponding floating point types.

The only concern I have is that we want to anticipate what the spec does so that we don't deviate unnecessarily. In the long-term, there may be a point where the Array API supports complex types and datetimes (I suspect a long-long way away), and at that point most of our nplike would become pass-throughs. In particular, the Array API doesn't mix int and float types. There seems to be a lean towards allowing complex and floats to mix, I imagine because the result is obvious (float64→complex128).

You asked about cases like concatenate, not mathematical functions.

Yes, this is special because the concat in the array API applies type promotion rules, whilst NumPy usually has a special concat result type table. This is only relevant for the timelike types, which are somewhat special; they have distinct concepts of promotions and result types; a timedelta64 can be added to a datetime64 (subject to kind constraints), but cannot be convertedto adatetime64. You also note the non-commutative-type operations like /`.

@agoose77
Copy link
Collaborator Author

agoose77 commented Nov 15, 2022

@jpivarski this reply is turning into a long post, so I'll give it some structure. I'll probably keep editing this, so best check the web version only.

The Discussion so Far

I've also come to realise that this is a bigger can of worms than I had first anticipated.

Let me summarise the worm-can here for posterity:

  • TypeTracer needs to be able to predict the result (types, shapes*) of nplike operations
  • NumPy et al. implement some data-dependent dtype promotion (min_scalar_type) (see refactor: harden typetracer operations #1849 (comment)), and all array libraries try to replicate NumPy at some level of formality (i.e. without a standard)
  • The Array API solves this by formalising type-only promotion semantics for a well defined set of operations
  • The Array API doesn't cover the scope of NumPy that we need e.g. complex numbers, datetimes

How we use NumPy / NumpyLike

Any which way we cut it, this is a tricky problem. A significant component of this is that we use NumPy1 / NumpyLike in several different ways:

  • internally to operate on indices, NumpyLike
  • internally for simple reductions (this will be curtailed with refactor: simplify reducer API #1793), NumpyLike
  • externally via ufuncs (ak.Array(...) + 1), NumPy
    • the __array_ufunc__ mechanism ultimately dispatches to NumPy.
  • externally via __array_function__ for non-overloaded functions on regular Awkward arrays, NumPy
    • the __array_function__ interface tries to cast the array to a NumPy array if there is no Awkward implementation (IIRC). This then uses NumPy or CuPy, etc.

The Problem

I was originally adopting the Array API-inspired interface for NumpyLike because it solves type-only promotion (for a subset of types). However, we can do whatever we like internally - users won't see NumpyLike - but externally we have user expectations. Due to the fact that we use NumPy/CuPy under the hood, we currently have value-based promotion at the user level with scalars

>>> import awkward as ak
>>> import numpy as np
>>> x = ak.from_numpy(np.arange(10, dtype=np.int8))
>>> ak.type(x)
ArrayType(NumpyType('int8'), 10)
>>> ak.type(x + np.int64(0))
ArrayType(NumpyType('int8'), 10)
>>> ak.type(x + np.int64(2**(32-1) - 1))
ArrayType(NumpyType('int32'), 10)

This is mediated by the ufunc mechanism, so it is not something that we can directly fix by patching nplike. If we want dask-awkward to support dak.Array(...) + dak.Array(...)[0], then we need to remove value-based promotion from the ufunc mechanism.

The value-agnostic result would be

>>> import awkward as ak
>>> import numpy as np
>>> x = ak.from_numpy(np.arange(10, dtype=np.int8))
>>> ak.type(x)
ArrayType(NumpyType('int8'), 10)
>>> ak.type(x + np.int64(0))
ArrayType(NumpyType('int64'), 10)
>>> ak.type(x + np.int64(2**(32-1) - 1))
ArrayType(NumpyType('int64'), 10)

which obviously differs from the NumPy equivalent (note that x is a non-ragged array)

>>> import numpy as np
>>> x = np.arange(10, dtype=np.int8)
>>> x.dtype
dtype("int8")
>>> (x + np.int64(0)).dtype
dtype("int8")
>>> (x + np.int64(2**(32-1) - 1)).dtype
dtype("int32")

So, this PR has morphed into a question of formalising our public-facing semantics in relation to NumPy for dask-awkward.

Possible Solution

There are several partially independent axes at play here:

  1. Adopting a more Array API-like interface for ease of development (i.e. don't support all the crazy options at the NumpyLike level) (internal)
  2. Removing value-based type promotion (user-visible, internal)
  3. Making TypeTracer predictable (internal)

Ultimately we want to address (2), because it affects the user as well as our internals, and therefore is a constraint.

I think @jpivarski that we do want valueless promotion semantics. My assumption is that we're okay to deviate from NumPy at the user-visible level if it's well thought out and well-defined. We already have some smaller differences, e.g. our np.sum(x, axis=-1) produces different results for large N, small M arrays of shape (N, M) due to pairwise summation.

So, we do need to ensure that ufuncs don't perform the wrong cast. We can do this in two ways:

  1. only correct NumPy for "wrong" cases
  1. implement our own formal semantics that we define
  • extend Array API to handle complex types, datetimes, mixed-type promotion

(2) use Array API

The Array API provides type-only promotion, but we would need to augment it: the Array API does not define promotion between different kinds e.g. int and float, so with the Array API syntax, 5 + 1.0 would not be defined by the canonical Array API. Of course, neither are complex numbers or datetimes, which we need to support. I would start by saying that we do want mixed type promotion; users expect int + float to produce a float, and it is predictable (albeit lossy for integers (whose magnitude lies above 2^53) when cast to doubles). The Array API does allow additional promotion rules:

Boolean, integer and floating-point dtypes are not connected, indicating mixed-kind promotion is undefined.
[...]
A conforming implementation of the array API standard may support additional type promotion rules beyond those described in this specification.

Implementing the Solution

If we are willing to make this formalisation, I think the way through this is:

  • implement a custom Array object as required by the Array API. This will delegate to the NumpyLike to provide the implementation for operations like array + other_array (where array is an internal array, not an ak.Array). We can then re-use this across NumpyLikes. This will ensure that we can trust operations on the array object e.g. x + y.
  • define promotion rules for complex numbers, and datetimes. Follow NumPy where possible.
  • define mixed-type promotion rules for integers, floats, complex numbers, etc. Follow NumPy where possible.
  • implement additional mechanism for non-commutative operations (in the sense that timedelta64 / integer is allowed, integer / timedelta64 is not).
  • apply the awkward promotion rules to ufunc arguments before calling the ufunc, so that the result has the correct type (?) If we are worried about overzealous promotion, we could check the ufunc typecodes first, and do a minimum-cost conversion (e.g. find the casts that cost the fewest bytes of copying, which is not guaranteed "lowest cost", but is trivial).
    • or, drop the ufunc implementation entirely and just use them to look up our own kernels. This wouldn't work for JAX, which rules it out for me.

In all, I'm not 100% sure on what the right call is. Each solution has drawbacks, and this is a hard problem. It feels like we're looking at needing to maintain a non-insignificant amount of code, but I think this is a symptom of the problem and not the solution. That said, I've been thinking about this for perhaps too long, so I'd appreciate @jpivarski your thoughts at some point.

Footnotes

  1. or CuPy, JAX

@jpivarski
Copy link
Member

I think @jpivarski that we do want valueless promotion semantics. My assumption is that we're okay to deviate from NumPy at the user-visible level if it's well thought out and well-defined.

Yes and yes: we absolutely want valueless promotion semantics, and we want it to be forward-looking, to agree with the Array API instead of NumPy. Eventually, NumPy will adhere to Array API by default, so this is a temporary situation.

Also, as you point out, we are already okay with some small differences from NumPy when they're well motivated.

How to implement it without introducing a lot of code to maintain: I hope we'll be able to create empty arrays, apply the operation to them, and read the dtype off the resulting (also empty) array.

As a reminder, this was the problem:

>>> import numpy as np, dask.array as da
>>> np.array([1, 2, 3], np.uint8) + np.uint16(10)
array([11, 12, 13], dtype=uint8)
>>> np.array([1, 2, 3], np.uint8) + np.uint16(300)
array([301, 302, 303], dtype=uint16)

Suppose that we pull in np.array_api to try to solve it?

>>> import numpy as np
>>> import numpy.array_api
<stdin>:1: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
>>> np.array_api.empty(0, dtype=np.array_api.uint8) + np.array_api.empty(0, dtype=np.array_api.uint16)
empty((0,), dtype=uint16)

I think we're currently using empty arrays as canaries (canary typing is some corollary to duck typing?), and when we don't put values in those arrays, even the current NumPy API does the above:

>>> np.empty(0, dtype=np.uint8) + np.empty(0, dtype=np.uint16)
array([], dtype=uint16)

So maybe we can't just pass arrays (with values) to NumPy and assume that it will do the right type propagation, but we can use empty arrays to predict a type, then apply it to the result, right? If need be, we can use np.array_api to do that (and hide the warning?), though at the moment, I don't see any examples in which it's needed.

@agoose77
Copy link
Collaborator Author

agoose77 commented Nov 17, 2022

So far I've formalised the NumpyLike API, which looks more like the Array API, but doesn't make as many guarantees about type promotion.

After our conversation on NumpyLike (@jpivarski), I set about building TypeTracer on top of arrays. It became apparent to me that the type-level separation of UnknownScalar, UnknownLength, and TypeTracerArray is somewhat cumbersome. Each class has its own rules about which operations map between the different types. This means that, from a type-hinting perspective, we'd need to implement most of the signatures three times.

I realised that the most obvious thing to start with is UnknownScalar is just a TypeTracerArray with a specific shape (). This would halve the amount of work / code to maintain / edges to reason about. Then the "scalar" part becomes just a runtime property (shape). This reflects what can happen when working with concrete NumPy; sometimes we have 0D arrays that act like scalars.

Then I thought about UnknownLength, which is really where the complexity remains; some operations on UnknownLength lose the "length" trait, so we need to define a whole set of signatures for these. I realised that we could just move this to a runtime trait, too. I went for a simple "traits" mechanism that allows us to tag an array with a set of string traits. We can then define a table of which operations preserve/lose these traits, with the rule that a binary operation only preserves the trait iff. both operands have the trait and the operation is known to preserve it too.

In code, this looks like

class TypeTracerTraits(str, enum.Enum):
    POSITIVE = "POSITIVE"


# Don't preserve positivity between `positive` arrays under these operations
TypeTracerArray.drop_trait_for(
    TypeTracerTraits.POSITIVE,
    operator.sub,
    operator.inv,
    operator.neg,
    operator.ge,
    operator.le,
    operator.gt,
    operator.lt,
    operator.eq,
    operator.ne,
)


# Allow non-TypeTracerArray scalars to automatically gain the positive trait, e.g. in `x + 1`
@TypeTracerArray.initialises_scalar_trait
def _initialise_scalar_trait(value):
    if isinstance(value, (int, float)) and value >= 0:
        return TypeTracerTraits.POSITIVE

This is the current WIP: https://github.com/scikit-hep/awkward/blob/agoose77/refactor-typetracer/src/awkward/_nplikes/typetracer.py

@jpivarski how do you feel about this code? Are you comfortable with merging the scalar and array types, and secondly to moving the "length"-ness to a runtime trait?

@jpivarski
Copy link
Member

Making UnknownScalar a TypeTracerArray with shape () is a good simplification. It doesn't conflate any currently distinct entities because NumpyArray always has at least one dimension (so that every Content has a length). The name UnknownScalar should stick around (as a synonym), or else we'll have to coordinate releases with dask-awkward, and now is not the time to do that because @douglasdavis is busy responding to @lgray's feedback.

Making UnknownLength a special case of UnknownScalar is also a good simplification, and it's not necessary to retain knowledge of it being non-negative. ("Positive" is not a sufficiently precise word, by the way.) The non-negativeness of UnknownLength is not used anywhere that I remember, and I doubt it matters. When creating tokens to represent missing information, there's always a question of how much information to let go.

I think you can get away with just assigning UnknownLength = UnknownScalar(np.int64) and call it a day.

@agoose77 agoose77 force-pushed the agoose77/refactor-typetracer branch from 56837d9 to 97a2589 Compare November 18, 2022 14:30
@agoose77 agoose77 force-pushed the agoose77/refactor-typetracer branch from 97a2589 to fba011f Compare November 18, 2022 15:12
@agoose77
Copy link
Collaborator Author

@jpivarski what does

"Positive" is not a sufficiently precise word, by the way

mean? Are you referring to the concept that zero is neither positive nor negative? I am using the CS interpretation here; that sign exists independently of the magnitude :)

@jpivarski
Copy link
Member

Using

  • "positive" to mean x > 0
  • "negative" to mean x < 0
  • "non-negative" to mean x >= 0
  • and (more rarely) "non-positive" to mean x <= 0

is standard usage; I haven't noticed any deviation from this in CS.

My absolute favorite use of precision language in documentation is Java's core library docs, which defines MATH.PI as

The double value that is closer than any other to pi, the ratio of the circumference of a circle to its diameter.

Since the irrational number $\pi$ is between two rational floating point values, the documentation even specifies which one: the closer one. (I don't know offhand whether that's the one that's above $\pi$ or the one that's below $\pi$.)

Floating point numbers have a separate sign bit, such that 0.0 is a different bit-pattern than -0.0:

>>> np.array([0.0]).tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.array([-0.0]).tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x80'

but these two values of zero are equal to each other; there is one zero value associated with two bit patterns. (There's a lot of distinct bit patterns for the NaN value.)

However, integers (UnknownLength is an integer) do not have a single sign bit. It is the case that the highest bit is only ever 1 when the integer value is negative, but the two's complement formalism affects all the bits when a number goes negative:

>>> np.array([0]).tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.array([1]).tobytes()
b'\x01\x00\x00\x00\x00\x00\x00\x00'
>>> np.array([-1]).tobytes()
b'\xff\xff\xff\xff\xff\xff\xff\xff'

@agoose77
Copy link
Collaborator Author

Floating point numbers have a separate sign bit, such that 0.0 is a different bit-pattern than -0.0:

Right, that's what I'm referring to. One of the most amusing classes I demonstrated in Physics involved discovering how people think floats work, vs how they actually work!

The double value that is closer than any other to pi, the ratio of the circumference of a circle to its diameter.

I audibly nose exhaled to this.

However, integers (UnknownLength is an integer) do not have a single sign bit

Hmm, we might be crossing wires; I'm talking about UnknownLength being >= +0 by describing it as +ve. I mention the sign bit in the sense that there are distinct zeros which are defined to be equal (unless you're taking the reciprocal). I didn't mean for the description to go further than that.

Anyway 🚂, let's go for non-negative. That suits me fine!

@agoose77 agoose77 force-pushed the agoose77/refactor-typetracer branch from e405d32 to 5af3605 Compare December 1, 2022 14:20
@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 1, 2022

WIP: add an API to smooth over typetracer changes

@agoose77 agoose77 removed the pr-next-release Required for the next release label Dec 2, 2022
@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 8, 2022

Something for me to think about here is what should NumpyArray.__array__ and NumpyArray.data return. The former should probably return a raw NumPy array (rather than our custom object), because that's what NumPy's array API does. Or, perhaps we should let this succeed by default, and instead require the user to evaluate np.asarray(layout.data). This would allow our nplikes to implement their own __array__ (or CUDA variant), rather than having them on NumpyArray.

@agoose77 agoose77 temporarily deployed to docs-preview December 20, 2022 14:23 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview December 20, 2022 15:27 — with GitHub Actions Inactive
@pre-commit-ci pre-commit-ci bot temporarily deployed to docs-preview December 20, 2022 18:04 Inactive
@agoose77 agoose77 temporarily deployed to docs-preview December 21, 2022 14:12 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 4, 2023

Jim and I had a conversation about this. Since #2020, #1928, etc, we have much less need to formally build out our array abstraction. At the same time, the challenges with this PR - the large amount of new code, and the need to use our own array object are still present. I've decided to reduce the scope of this PR; instead, we will prohibit operations on arrays that do not proceed via the nplike API, and implement the appropriate tests there such that we do not need a special array object. This can be accompanied by a runtime test to ensure we are only using nplike, or a mypy-level type check.

@agoose77
Copy link
Collaborator Author

Closing to ... once again ... create a new PR.

@agoose77 agoose77 closed this Jan 11, 2023
@agoose77 agoose77 deleted the agoose77/refactor-typetracer branch April 11, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement all ufuncs in TypeTracerArray (was: ak.num(array, axis=1))
2 participants