-
Notifications
You must be signed in to change notification settings - Fork 0
C integer types: the missing manual
Here be dragons.
Throughout the scikit are bits and pieces written in Cython, and these commonly use C integers to index into arrays. Also, the Python code regularly creates arrays of C integers by np.where
and other means. Confusion arises every so often over what the correct type for integers is, esp. where we use them as indices.
There are various types of C integers, the main ones being for our purposes:
-
int
: the "native" integer type. This once meant the size of a register, but on x86-64 this no longer seems to be true, asint
is 32 bits wide while the registers and pointers are now 64 bits. -
size_t
: standard C89 type, defined in<stddef.h>
and a variety of standard C headers. Always unsigned. Large enough to hold the size of any object, i.e. 64 bits on a 64-bit machine, 32 bits otherwise. This is the type of a Csizeof
expression, of the return value ofstrlen
, and its what functions likemalloc
,memcpy
andstrcpy
expect. Use when dealing with these functions. -
Py_ssize_t
: type defined in<Python.h>
and declared implicitly in Cython, that can hold the size (in bytes) of the largest object the Python interpreter ever creates. Index type forlist
. 63 bits + sign on x86-64; in general, the signed counterpart ofsize_t
, with the sign used for negative indices sol[-1]
works in C as well. Use when dealing with the CPython API. -
np.npy_intp
: type defined by the NumPy Cython module that is always large enough to hold the value of a pointer, likeintptr_t
in C99. 63 bits + sign on x86-64, and probably always the size ofPy_ssize_t
, although there's no guarantee. Use for indices into NumPy arrays; the NumPy C API expects this type.
Now to confound matters:
-
BLAS uses
int
for all its integers, except (in ATLAS) the return value from certain functions, which issize_t
. It follows that when you call BLAS, you shouldn't expect to be able to handle array dimensions ≥2³¹, akaINT_MAX
(from<limits.h>
). If this doesn't sound like much of a problem, consider the fast ways to compute the Frobenius norm of a matrix:scipy.linalg.norm(X) # or sqrt(np.dot(X.ravel(), X.ravel()))
The ravel'd array, implicit in the call to norm
, has only one dimension, which may be ≥2³¹. This is no problem for NumPy's array data structure, but norm
may call cblas_nrm2
, and that can't handle the array size correctly (dot
has been fixed). Most likely, it'll process only part of the array, but this depends on the implementation.
-
scipy.sparse
uses index arrays of typeint
to represent matrices in COO, CSC and CSR formats, so it has much the same limitation as BLAS.n_samples
,n_features
and the number of non-zero entries are all three limited to 2³¹-1. SciPy 0.14 has 64-bit indices as well; we'll probably need to use fused types in all the sparse matrix-handling Cython code to properly support these. -
Since
npy_intp
is an alias at the C level, NumPy has no way of showing that a variable is of this type in Python. Instead, it shows the actual type, so on x86-64 (but not on i386, and probably not on ARM), you'll get results like>>> type(np.intp(1)) # corresponds to npy_intp <type 'numpy.int64'> >>> type(np.intc(1)) # corresponds to a C "int" <type 'numpy.int32'> >>> np.where([True])[0].dtype dtype('int64') # actually an npy_intp
-
np.random.randint
returns a Pythonint
(variable-size integer) when asked for one number. When asked for an array, it returns either 32-bit or 64-bit integers depending onsizeof(long)
; this is hardcoded in the C implementation. On most platforms, this conforms to the size ofnpy_intp
, but again there's no guarantee and getting random indices can be tricky.