In my thesis, I would like to provide official and practical typing support for
the numpy
and related linear algebra libraries. Mainly, I am interested in
preventing dimensionality errors that can occur during common matrix operations,
including dot products, matrix multiplication, and concatenation.
In this Markdown file, I'll detail my progress over the course of the semester on my thesis project. My work proceeded more or less sequentially as follows:
- Research the common
numpy
operators and formally detail their behavior. - Reach out to the maintainers of
mypy
and understand the typing primitives that are currently available and are planned. - Identify potential shortcomings of the current primitives and develop and
submit a proposal for new primitives and operators for
mypy
. - Receive and digest feedback from
mypy
maintainers. Propose final plans for contribution tomypy
module. - Make first contribution to
mypy
codebase. - Research into other potential methods of typing Python, drawing inspiration from Liquid Haskell.
- Plan contributions for next semester as well as further research interests.
The fundamental unit of interest in numpy
or np
is the np.array
object,
which is the primary matrix/data object. The np.array
can hold an array of
numbers, and its shape is accessible via the .shape
attribute. To get a better
understanding of how the .shape
attribute works, we can go over some examples:
-
Suppose we have a simple array of 12 integers. This is an
np.array
with.shape = (12,)
.x = np.array(list(range(12))) print(x.shape) >>> (12,)
-
Suppose we have a dataset of 12000 points which each have two features. The shape of this
np.array
is(12000, 2)
. -
Suppose we have a dataset of 12000 images which are each 3x3 in size. The shape of this
np.array
is(12000, 3, 3)
. -
Even scalars can be considered
np.array
types with.shape
being the empty tuple.
So in general, the length of the .shape
tuple indicates the number of axes
that the np.array
has and the number in each tuple position indicates the
length of that particular axis.
Now, we can go over some of the most common operations on numpy
arrays. We can
specify them by writing the type of the inputs, requirements on the relationship
between the inputs, and the shape of the output. To help make specifying the
relationships easier, we'll use some Python pseudocode at some points.
-
broadcasting - this happens when we perform traditional infix operations between two
np.array
objects, like*
,/
,**
etc, or anp.array
object with something that isarray-like
(mainly just scalars that haven't already been converted tonp.array
s.-
input - two arrays
A
andB
-
require
z = min(len(A.shape), len(B.shape)) for i in range(z): assert ( A[n - 1 - i] == B[m - 1 - i] or A[n - 1 - i] == 1 or B[m - 1 - i] == 1 )
This essentially just reverses the
.shape
attributes of both arrays and then zips them together. It then checks for equality between all the zipped elements or that one of the elements is equal to 1. -
output - single array
C
wherelen(C.shape) == max(len(A.shape), len(B.shape))
whereC_i = max(A_i, B_i)
-
-
np.dot
is the union type of multiple functions types. Depending on the shape of thenp.array
s it works on, it has multiple behaviors.np.dot
is essentially the very overloaded "multiplication" operator innumpy
. It takes the dot product as advertised, but when handed2x2
matrices, it performs matrix multiplication as a convenience.-
Actual dot product
- input - two arrays
A
andB
wherelen(A.shape) == 1
andlen(B.shape) == 1
. - require
assert A.shape[0] == B.shape[0]
- output - array
C
wherelen(C.shape) == 0
. In other words,C
is a scalar.
- input - two arrays
-
Matrix multiplication
- input - two arrays
A
andB
wherelen(A.shape) == 2
andlen(B.shape) == 2
. - require
assert A.shape[1] == B.shape[0]
- output - array
C
wherelen(C.shape) == 2
andC.shape[0] == A.shape[0]
andC.shape[1] == B.shape[1]
- input - two arrays
-
Scalar multiplication
- input - two arrays
A
andB
where eitherlen(A.shape) == 0
orlen(B.shape) == 0
. Without loss of generality, assumelen(A.shape) == 0
. - require - nothing
- output - array
C
wherelen(C.shape) == len(B.shape)
- input - two arrays
-
Generalized dot product
- input - two arrays
A
andB
wherelen(B.shape) == 1
. - require
assert A.shape[-1] == B.shape[0]
- output - array
C
wherelen(C.shape) == len(A.shape) - 1
andC.shape[i] == A.shape[i] for i in range(len(C.shape))
- input - two arrays
-
Generalized multiplication
- input - two arrays
A
andB
whose dimensions meet none of the criterion above. - require
assert A.shape[-2] == B.shape[-1]
- output - array
C
whereC.shape == A.shape[:-2] + A.shape[-1:] + B.shape[:-1]
.
- input - two arrays
-
-
np.sum
takes the sum of the passed innp.array
. Interestingly, it has an optionalaxis
argument which can be used to specify the axis or axes along which the user would like to sum.- input - array
A
and optionalaxis
which maybe be anint
ortuple
- require
assert isinstance(axis, int) or isinstance(axis, Tuple[int]) if isinstance(axis, int): assert axis < len(A.shape) else: assert len(set(axis)) == len(axis) # assure no duplicates for ax in axis: assert ax < len(A.shape)
- output - array
B
where all the provided axes are removed fromA.shape
. If noaxis
was provided, thenB.shape == 1
.
- input - array
-
np.linalg.norm
behaves identically tonp.sum
except it computes the norm along a particular axis. But the shape result is identical. -
np.zeros
andnp.ones
behave identically shapewise although the first creates an array of zeros and the other of ones- input - tuple of integers
x
- require - nothing
- output - array
A
whereA.shape == x
- input - tuple of integers
-
np.concatenate
concatenates two arrays together along an axis-
input - list of
np.array
sAs
, optionally provideaxis
with default value of0
-
require
for A in As: assert A.shape == As[0].shape n = len(A[0].shape) assert axis < n # this loop asserts that all the arrays agree in all dimensions except the specified axis for i in range(n): if i != axis: assert all(A.shape[i] == As[0].shape[i] for A in As)
-
output - array
C
withlen(C.shape) == len(As[0].shape)
andC.shape[i] == As[0].shape[i] for i != axis
andC.shape[axis] == sum(A.shape[axis] for A in As)
.
-
After reaching out to the Python community, the original proposal for numpy
typing was shared with me. The new type object is called a "variadic generic"
and this in combination with the existing Literal
and TypeVar
types are
proposed to solve the numpy
typing problem.
Instantiate with
Ts = TypeVar('Ts', variadic=True)
Can be used in varidaic type constructors such as Tuple
like in
Tuple[int, Expand[Ts]]
or in the case of a potential numpy
type Shape
:
Shape[Ts]
Suppose we want to type an operation that drops the first axis of an array. We
define a new Shape
variadic type as well as IntVar
to create type variables
that stand for types. We might type this as follows:
N = IntVar('N')
Ns = IntVar('Ns', variadic=True)
def drop_first(A: Shape[N, Ns]) -> Shape[Ns]:
...
For some of the functions we mentioned above, we can use the @overload
directive in mypy
in conjunction with the Literal
type constructor. For
np.mean
, we might have
@overload
def mean(array: Shape[N, Ns],
axis: Literal[0]) -> Shape[Ns]: ...
@overload
def mean(array: Shape[N, M, Ns],
axis: Literal[1]) -> Shape[N, Ns]: ...
@overload
def mean(array: Shape[Ns, N],
axis: Literal[-1]) -> Shape[Ns]: ...
Note: I shared a modified version of the following with the maintainers of
mypy
to get their thoughts. I'll cover their response in the next section.
I think the Variadic types are very flexible, and in general, I think they will
be very useful for defining types of many numpy
operators. However, I have
some concerns about functions like np.mean
though. In the current proposal (as
far as I can tell), they are implemented as
@overload
def mean(array: Shape[N, Ns],
axis: Literal[0]) -> Shape[Ns]: ...
@overload
def mean(array: Shape[N, M, Ns],
axis: Literal[1]) -> Shape[N, Ns]: ...
@overload
def mean(array: Shape[Ns, N],
axis: Literal[-1]) -> Shape[Ns]: ...
which is correct, but also extremely inflexible. Even if we were to generate
hundreds of these stubs, we still would be unable to capture completely all
possible use cases of the mean
function for numpy
arrays with hundreds of
axes. Although such a situation is hard to imagine, it is not impossible, and it
would be preferable to cover any correct use. This also does not even cover the
possibility that np.mean
can also accept a tuple argument as axis like
@overload
def mean(array: Shape[M, N, Ns],
axis: Tuple[Literal[0], Literal[1]]) -> Shape[Ns]: ...
which would lead to an even greater explosion of stubs to be generated.
Instead, I think we should aim for a more general type definition, one that can
adequately capture all use cases simultaneously. In the case of mean
, we can
target something like
@overload
def mean(array: Shape[Ms, N, Ns],
axis: K) -> Shape[Ms, Ns]: ...
However, missing from this type definition is the information that K
specifies
the particular location of N
to remove from the resulting shape. And even if
that information were to somehow be specified in the types, we would still be at
a loss for the case where axis
is a tuple of integers.
Something else that I've seen proposed is the idea of a special Drop
operator,
which would result in a type that looks like the following:
@overload
def mean(array: Shape[Ms],
axis: K) -> Shape[Drop[Ms, K]]: ...
@overload
def mean(array: Shape[Ms],
axis: Tuple[Ks]) -> Shape[Drop[Ms, Ks]]: ...
This is attractive because it gives us the ability to fully specify the type
signature for even the tuple case. However, it does feel very "special-cased"
and might not extend well to the signature of other methods in numpy
or the
signature of methods that people themselves want to define. Also, as an aside,
the variadic Ks
in the second type definition above does not capture the fact
that all the Ks
must be distinct within the axis
tuple.
As another case study, I would like to look at typing the function
np.concatenate
, which I feel is fairly common in numpy
use cases. Again, we
want to avoid generating stubs for specific axis arguments for np.concatenate
,
so we can turn to the other two methods as mentioned above. As a first attempt,
we might write
def concatenate(arrs: List[Shape[Ms, N, Ns]]
axis: K) -> Shape[Ms, Sum[N], Ns]: ...
This captures the fact that all the lists must have the same number of axes and
those axes must align in all but one position. But this obviously has a lot of
problems. Again, we run into the issue of specifying that K
must be the same
as the index of the N
in the list of arrays. And also, this type declaration
seems to imply that N
must be the same across all the arrays when in fact it
is allowed to be different between the arrays. And also, we rely on a Sum
construct which doesn't actually exist.
The other approach doesn't seem very promising either; it's unclear that Drop
will help here and neither would an Index
construct.
What the concatenate
example seems to indicate is that our type system isn't
flexible enough to capture all the operations that we might be interested in
defining within the numpy
library. If we wanted to have full flexibility, we
might imagine a system where we can actually define functions that receive the
types of the inputs to the function as inputs to a function that simply outputs
the type as its returned type. Such a system is tested in demo.py
where we
define in Python the type signatures for several functions. For example, the
concatenate
function type might be written as
def concatenate_typ(arr_typs, axis):
if any(len(arr_typ) != len(arr_typs[0]) for arr_typ in arr_typs):
return None
final_type = []
for i, *axes in enumerate(zip(*arr_typs)):
if i == axis:
final_type.append(sum(axes))
continue
if any(axis != axes[0] for axis in axes):
return None
final_type.append(axes[0])
return final_type
This approach clearly has a lot of shortcomings. It's extremely verbose; a
nontrivial amount of code needed to be written just for this "type". It would
require the exposing of mypy
internals so that users would be able to access
the mypy
inferred types to make judgements about the final type. It would
require mypy
to execute arbitrary user code, which is also not desirable.
Despite all these shortcomings, it is (so far) the only method that seems to
able to adequately capture the full type information of concatenate
.
We need to find a middle ground here that gives us more expressivity without the cost of verbosity. I'm going to suggest one possibility here, but I'm definitely open to suggestions and/or feedback!
One possibility to solve this issue is to hybridize the two variadic methods
above and introduce a @requires
and @ensures
decorator to function
signatures where we can specify "contracts" about the types of the arguments.
This gives us the flexibility to enforce further constraints on types that can't
be specified by the syntax of the type. As an example with mean
:
@overload
@requires(axis == IndexOf[N])
def mean(arr: Shape[Ms, N, Ns],
axis: K) -> Shape[Ms, Ns]: ...
I recognize that this is a little wonky since the @requires
refers to axis
which isn't yet bound within the scope of the program. We could fix this by
making the @requires
take a lambda:
@overload
@requires(lambda _, axis: axis == IndexOf[N])
def mean(arr: Shape[Ms, N, Ns],
axis: K) -> Shape[Ms, Ns]: ...
If that still doesn't seem satisfactory, we can always iterate to improve on it!
There is also the construct of IndexOf
which doesn't yet exist within the
mypy
system.
For the case of mean
with Tuple
axis argument, we might write
@overload
@requires(lambda axis: all(ax1 != ax2 for i, ax1 in enumerate(axis) for ax2 in axis[i + 1:])) # uniqueness property
def mean(arr: Shape[Ms],
axis: Tuple(Ks)) -> Shape[Drop[Ms, Ks]]: ...
For the concatenate operator, we'd need several operators:
@requires(all(len(arr) == len(arrs[0]) for arr in arrs))
@requires(all(ax == axs[0] for i, *axs in zip(arrs) for ax in axs if i != K))
@ensures(all(out_ax == ax if i != K else out_ax == sum(axs) for i, *axs in zip(arrs) for ax in axs))
def concatenate(arrs: List[Shape],
axis: K) -> Shape: ...
Again, this has its host of problems since it refers to unbound variables and
the specification of the output type is still hazy, but it still captures the
ideas behind the @requires
and @ensures
clauses.
Ivan was kind enough to reply to my writeup from above and offer feedback.
First, he noted as an aside that that the concern about writing thousands of
stubs for np.mean
was mostly unfounded since people rarely used more than 7 or
so axes. This is partly from a practicality standpoint and because of "the curse
of dimensionality" causing a greater number of axes to be infeasible anyways. He
also cited the fact that the current implementations of map
and zip
within
the Python standard library only have 7 overloads each and no one in all of
mypy
's usage has ever complained.
He did agree with the need for a Drop[Ts, K]
operator and even a "double
variadic" form of Drop[Ts, Ks]
.
In Ivan's industry experience with static type checkers and literally meeting
with the data science teams that would use mypy
and numpy
type-checking, it
turns out that people don't want to write very complicated function signatures
most of the time. 90% of the time, they can write very simple signatures with
the primitives that have been defined.
For the very complicated functions (that other 10% of the time), mypy
supports
a plugin system where people can write very precise type-checking for certain
types of functions. This includes functions like concatenate
which I noted in
my writeup were very difficult to express.
Overall, Ivan noted two things:
-
"Practicality beats purity" - in Python and
mypy
, sometimes shortcuts are taken so that it is actually useful for the average person without getting too bogged down in theory and syntax. -
It's better to make progress in small steps and then iterate to refine the product. Once a minimum working product has been completed, it can then be adjusted, refined, and expanded to fit more situations.
Overall, Ivan would like me to start working on a minimal implementation in
mypy
as soon as possible, which I intend to do over winter break and through
next semester.
Albeit small, I made a contribution to mypy
during the course of the semester.
This was my first chance at downloading the codebase, interacting with parts of
it, running and writing tests, and getting code reviewed by the maintainers. You
can find the merged PR here.
Ultimately, I would like to make a major contribution to mypy
by implementing
these numpy
typings, even in their most basic form, so that I can lay the
groundwork for more serious typing work for Python in the future.
To ramp up so far, I've been looking at various bug reports or feature requests and taking care of them incrementally.
I've arranged a conference call with the core mypy
and pyre
(FB's typing
library) maintainers to discuss a Python Enhancement Proposal (PEP) regarding
the numpy
types.
Ivan has already laid out a path to the numpy
typing, which is roughly:
- General support for "variadicity" and ArgSpecs
- Support for array-arity declaration/checks
- Support for fixed shape types (integer generics)
- Support for algebra of dimensions
I will work more closely with him and communicate more frequently to get guidance on the specific implementation details, and I hope to accomplish at least everything he has laid out by the end of the next semester.
In addition to contributing to the main mypy
, I am also interested in
embedding these matrix operations into other, more fully featured languages to
investigate the practicality of the typing primitives. I discuss this further in
the next section.
Liquid Haskell is a refinement type system built on top of Haskell. This system
allows for logical predicates to be embedded in the types of functions and
arguments, which almost exactly corresponds to the @requires
and @ensures
clauses from earlier.
As an exploration into the feasibility, usability, and usefullness of full refinement types in matrix operations, I propose implementing a fully featured matrix library in Liquid Haskell or an equivalently strong type-system to see if the tradeoffs are worth bringing over to Python.