- Feature Name: generic_int
- Start Date: 2018-05-15
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)
Adds the builtin types uint<N>
and int<N>
, allowing integers with an
arbitrary size in bits. For now, restricts N ≤ 128.
This feature will ultimately allow bitfields, which allow multiple fields in a struct to be defined as a range of bits, rather than bytes. Here's an example in C:
struct MipsInstruction {
int opcode: 6;
int rs: 5;
int rt: 5;
int rd: 5;
int shift: 5;
int function: 6;
}
In this format, all the data we need (in this case, one particular kind of MIPS
instruction) is stored within 32 bits, but we don't have any particular way to
map these to fields. In today's Rust, if we wanted to represent this struct,
we'd have to create methods to grab the data for each field using bit shifts and
masking. Here's an example of what this looks like for the rs
field:
fn get_rs(inst: u32) -> u32 {
(inst >> 21) & !(!0 << 5)
}
fn set_rs(inst: u32, rs: u32) -> u32 {
inst & (!(!0 << 5) << 21) | ((rs & !(!0 << 5)) << 21)
}
As you can see, getting the shift (21
) and mask (!(!0 << 5)
) is not as
obvious as you'd think. The shift is actually the sum of the widths of all of
the fields after rs
(5 + 5 + 5 + 6 == 21
), and the mask is actually
0b11111
, where the number of ones corresponds to the size of the field. It's
very easy for a human to mess these up, whereas in this case, C does all of the
work for you.
This RFC actually doesn't solve these problems, but solves a subtler problem
that C doesn't even try to solve: what is the type of the rs
field itself?
If we were to write out this hypothetical struct in Rust, we'd want something like:
#[repr(bitpacked)]
struct MipsInstruction {
opcode: u6,
rs: u5,
rt: u5,
rd: u5,
shift: u5,
function: u6,
}
Unfortunately, u5
and u6
aren't valid types in Rust, and C's way of doing
u8: 5
sounds even less desirable. Luckily, with const generics, we have a
better solution: uint<N>
:
#[repr(bitpacked)]
struct MipsInstruction {
opcode: uint<6>,
rs: uint<5>,
rt: uint<5>,
rd: uint<5>,
shift: uint<5>,
function: uint<6>,
}
Additionally, methods can take or return uint<5>
and uint<6>
to statically
guarantee that a value does not have more bits than necessary. If a method to
set opcode
for example took a u8
, it's not clear whether the method should
simply ignore the upper two bits or if it should panic if they're non-zero.
As stated before, #[repr(bitpacked)]
is not proposed by this RFC and is not
necessarily the way this will work. However, uint<N>
is what this RFC
proposes.
Right now, there's a lot of boilerplate for implementing methods for integer
primitives. The standard library itself is a great example; almost the entirety
of core::num
uses some gnarly macros to define all sorts of traits for all the
integer types. One example is Shl
and Shr
, which are defined for not just
every integer type, but every combination of integer types. We could easily do
this with const generics instead:
impl<const N: usize, const M: usize> Shl<uint<M>> for uint<N> {
type Output = uint<N>;
#[inline]
fn shl(self, rhs: uint<M>) -> unit<N> {
// implementation
}
}
This will probably also decrease compiler time as well, as the compiler simply
needs to understand one generic impl instead of 144 macro-generated impls.
Because Shl
and Shr
can also handle references, that multiplies our
generated impl count by four per trait, meaning that our macros generate 576
impls total!
Using a smaller number of bits for a value also has the benefit of niche enum
optimisations. For example, uint<7>
represents a single ASCII character, and
Option<uint<7>>
can be stored in a single byte. Additionally,
Result<uint<7>, E>
also takes one byte if E
is zero-sized.
Having generic impls would drastically reduce the noise in the "implementations"
section of rustdoc. For example, the number of implementations for Add
for
integer types really drowns out the fact that it's also implemented for strings
and std::time
types, which is useful to know too.
Up until now, we've only seen specific integer types, like u8
or i32
. What
if we want to write a trait that works with all of them?
Let's go with a simple example. Let's say that we want to code our own Default
trait which returns one instead of zero. We'll define it like this:
pub trait MyDefault: Sized {
fn my_default() -> Self;
}
Right now, with what we know, we have to copy-paste our impl for every integer type. It would look something like:
impl MyDefault for u8 {
fn my_default() -> u8 { 1 }
}
Except, we'd have to replicate this for every single integer type. If we're clever, we could use a macro:
macro_rules! impl_my_default {
($($int: ident),*) => {$(
impl MyDefault for $int {
fn my_default() -> $int { 1 }
}
)*}
}
impl_my_default!(i8, u8, i16, u16, i32, u32, i64, u64, i128, u128, isize, usize);
But really, that's just making the compiler do the copy-pasting for us. Instead,
we're going to use the special types uint<N>
and int<N>
to generalise the
code for us. For this, we're going to use const generics, instead of type
generics. The end result looks like:
impl<const N: usize> MyDefault for uint<N> {
fn my_default() -> uint<N> { 1 }
}
impl<const N: usize> MyDefault for int<N> {
fn my_default() -> int<N> { 1 }
}
impl MyDefault for usize {
fn my_default() -> usize { 1 }
}
impl MyDefault for isize {
fn my_default() -> isize { 1 }
}
That's a lot better! As you'll notice, we still have to separately implement
code for usize
and isize
; that's because they're separate types altogether.
For example, on 64-bit systems, usize
is technically just a u64
, but
uint<64>
refers to u64
, not usize
. It's unfortunate, but we've reduced our
number of impls from twelve to four, which is still a very big deal!
Note that there's one slight caveat here, which is that our my_default
method
might overflow. This seems silly, but there's one type, int<1>
where you can't
have the value 1, and because of this, the compiler will emit an
overflowing_literals
lint. In general, if you're casting a literal to a
generic integer, you can't expect any value other than zero to work. In the
future, we'll be able to annotate our int<N>
impl with something like
where N > 1
, but until then, we'll have to deal with this weird overflow
behaviour.
One other side effect of having uint<N>
and int<N>
is that we can represent
a lot more types than before. For example, uint<7>
is just a seven-bit
integer, which we might use to represent an ASCII character. Don't be fooled by
these; we can't actually make N
bigger than 128 (for now), and uint<N>
is
almost always stored using uint<N.next_power_of_two()>
. uint<7>
is stored as
u8
, uint<10>
is stored as u16
, uint<48>
is stored as u64
, etc. If you
store a value with a nonstandard amount of bits, you're not actually saving any
space, so, keep that in mind!
That said, there are a few cases where we might want to use one of these weirdly sized types, as many data formats store values in terms of bits instead of bytes to save space. Eventually, we'll have a way of representing these, but not today.
The compiler will gain the built-in integer types uint<N>
and int<N>
, where
const N: usize
. These will alias to existing uN
and iN
types if N
is a
power of two and no greater than 128. usize
, isize
, and bool
remain
separate types because of coherence issues, i.e. they can have separate
implementations and additional restrictions applied.
int<N>
and uint<N>
will have the smallest power-of-two size and alignment.
For example, this means that uint<48>
will take up 8 bytes and have an
alignment of 8, even though it only has 6 bytes of data.
int<N>
stores values between -2N-1 and 2N-1-1, and
uint<N>
stores values between 0 and 2N-1. It's valid to cast
int<N>
or uint<N>
to int<M>
or uint<M>
via as
, and sign extension will
occur as expected. This conversion is lossless when converting int<N>
to
int<M>
or uint<M>
, or from uint<N>
to int<M>
or uint<M + 1>
, where
M >= N
.
In addition to the usual casts, uint<1>
can also be cast to bool
via as
,
whereas all other integer types can only be cast from bool
. This does not
apply to int<1>
, as the value true
is equivalent to 1, and int<1>
does not
include the value 1. int<1>
will still allow casts from bool
, which will
overflow to -1 as expected.
For the moment, a monomorphisation error will occur if N > 128
, to minimise
implementation burden. This error can be lifted in the future, and if any future
uN
or iN
types are added, these should alias with existing int<N>
and
uint<N>
types. The main justification for this change is to allow the compiler
to continue to represent literals and other values using u128
internally, and
to allow stabilising this feature before larger ints are allowed internally.
uint<0>
and int<0>
are allowed as constant zero types. While this is not
inconsistent for uint<0>
, as zero fits in the range [0, 2n-1], it
is somewhat inconsistent for int<0>
, as the expected range for int<0>
is
-½ to ½. It is decided to make both of these a constant zero type but not
identical types, as the invariant
<uint<N>>::default() == <int<N>>::default() == 0
is desired.
Integer literals can be automatically coerced to uint<N>
and int<N>
,
although generic iN
and uN
suffixes are left out for future RFCs. Because of
the restriction that N <= 128
, compiler code can continue to store literals as
u128
. When coercing literals to uint<N>
or int<N>
, bounds checks should be
done just like they are with regular integer literals.
Primitive operations on int<N>
and uint<N>
should work exactly like they do
on int<N>
and uint<N>
: overflows should panic when debug assertions are
enabled, but ignored when they are not. In general, uint<N>
will be
zero-extended to the next power of two, and int<N>
will be sign-extended to
the next power of two.
Right now, many methods on the integers are implemented using intrisics that
only work on the existing integer types, which are natively supported by LLVM;
this will not change. Instead, methods that require these intrinsics will be
implemented using specialisation, i.e. adding separate default impl
versions
which will ultimately call the intrinsic-based versions.
This way, instructions which work on the native sizes will be emitted for those
sizes, and non-native sizes will "emulate" the method using the native version.
For example, take the count_zeroes
function:
default impl<const N: usize> uint<N> {
fn count_zeros(self) -> u32 {
const M: usize = N.next_power_of_two();
let zeros = (self as uint<M>).count_zeros();
zeros - (M - N)
}
}
// impl u8, etc.
Let's look at the specific case for uint<7>
:
fn count_zeros(self) -> u32 {
const M: usize = 7.next_power_of_two();
let zeroes = (self as uint<M>).count_zeroes()
zeroes - (M - 7)
}
Ultimately, this will optimize to:
fn count_zeroes(self) -> u32 {
(self as u8).count_zeroes() - 1
}
This way, we can implement all of the expected integer functions without having to mess with the underlying intrinsics.
Right now, Rust optimises enums for types which have unrepresentable values. For
example, char
is basically a u32
under the hood, but the values in the range
0xD800..0xE000
aren't allowed. The compiler can exploit this, allowing
Option<char>
to be the same size as a char
; the space of invalid values
(niche) can be used to store enum discriminants.
As a hard rule, uint<N>
will always have its upper N.next_power_of_two() - N
bits set to zero, and similarly, int<N>
's upper bits will be sign-extended. To
compute niche optimizations for enums, we'll have to take this into account.
Niche values for uint<N>
are simply any nonzero patterns in the upper bits,
similarly to bool
. For example, because uint<7>
can be from 0 to 127, the
values from 128 to 255 are free to use; this allows us to represent 127 unit
variants in those extra values. Imagine a weirder version of Option<uint<7>>
,
where there are 127 different versions of None
-- this would be the same size
as a uint<7>
.
Niche values for int<N>
are trickier. Because of sign extension, the extra
bits can be either zero or one. However, the sign is "extended," which we can
exploit-- as long as at least one of the upper bits is different from the sign
bit, we're okay. This gives us exactly the same number of values as we'd
otherwise get from uint<N>
, but we'll have to warp our brains a bit to
understand them.
Below, I'll only draw the upper, "middle," and lower bits, so that we can
convince ourselves that this applies to any number of bits. The middle bit will
be the sign bit at bit N-1
, and we'll also show the bits around it. Let's
start counting, from zero:
00...000...00 (0)
00...000...01 (1)
00...000...10 (2)
...
00...001...11 (2.pow(N) - 1)
Note that the next value sets our sign bit to one, but leaves the remaining values past the sign at zero. This starts the niche space, where the sign bit isn't the same as all the bits after it:
00...010...00 (2.pow(N) + 0)
00...010...01 (2.pow(N) + 1)
00...010...10 (2.pow(N) + 2)
...
00...011...11 (2.pow(N + 1) - 1)
Note that now, once we add one, our sign bit will flip back. However, even if we keep counting, it won't be extended all the way to the end. So, really, our niche extends quite far, until...
00...100...00 (+2.pow(N + 1))
00...100...01 (+2.pow(N + 1) + 1)
00...100...10 (+2.pow(N + 1) + 2)
...
11...101...11 (-2.pow(N) + 1)
Finally, at this point, adding one will flip our sign bit to be negative, and it
will finally be extended all the way. After this, our ordinary values resume,
starting at -2.pow(N)
and ending at -1
.
As you can see, we have exactly the same number of values to use as niche, and they're all in one contiguous range.
Because sign extension will always be applied, it's safe for the compiler to
internally treat uint<N>
as uint<N.next_power_of_two()>
when doing
computations. As a concrete example, this means that adding two uint<48>
values will work exactly like adding two u64
values, generating exactly the
same code on targets with 64-bit registers. Right?
Actually, it turns out that things will be slightly more complicated. In debug mode, the generated code will be very similar; overflow checks will still panic as usual. However, in release mode, as we saw in the above section, there will have to be additional code to make sure that overflows and underflows extend the sign correctly.
This is actually a good case for adding a add_unchecked
(and similar) methods
which require code to ensure that overflow does not occur, skipping the extra
code in release builds. Such methods would require their own RFC, but this
actually does provide a decent reason why they'd be useful.
Unfortunately, these generalisations aren't going to be free in all cases.
The restriction that N <= 128
may allow for future changes where uint<N>
is
not actually stored as uint<N.next_power_of_two()>
for sufficiently large N
.
Right now, even storing uint<65>
as u128
makes the most sense, because it
will be the most efficient representation on 64-bit CPUs for performing many
operations. However, storing uint<4097>
as uint<8192>
, for example, may be
total overkill.
The main goal for these representations is to avoid bit-wrangling as much as
possible. Although truncating the number of bits for integers is no longer free,
e.g. u64
to uint<48>
requires an AND and i64
to int<48>
requires
conditionally excecuting an AND or an OR, the goal is that in most cases, using
non-power-of-two lengths should not incur too much of a performance penalty.
Additionally, the compiler should be able to optimise the sign extension to
happen at the end of a series of operations, rather than after every operation.
For now, additional primitive.uint
and primitive.int
pages will be added to
rustdoc, and the existing uN
and iN
pages will be left as-is. Eventually, if
we're comfortable with it, we can remove the uN
and iN
pages entirely and
use primitive.int
and primitive.uint
as the single source of documentation
for these types, plus the pages for usize
and isize
.
There certainly is a precedent for this: as of right now, all of these pages
share the same documentation, and the examples are modified to work for all of
these types. Removing these separate pages would help remove documentation
redundancy, although usize
and isize
would still have to be kept separate.
The most obvious extension is to allow arbitrarily sized integer types. One
clear use for this is cryptographic code which often works with
fixed-but-large-length integers, which would no longer have to decide whether
to offer its own u256
or u4096
types. Additionally, generalisation over
integers could allow code that works with e.g. 1024-, 2048-, and 4096-bit keys
to have their arithmetic monomorphised and specialised for free without writing
extra code.
Another benefit of this would be allowing arbitrary-length integer literals, which would be useful for bignum libraries. Instead of parsing a string or byte array to create a bignum, the bignum library could simply have:
fn new<const N: usize>(num: uint<N>) -> BigNum {
// ...
}
Internally, the compiler could represent the power-of-two uN
and iN
as a
doubled_uint<N / 2>
and doubled_int<N / 2>
, which could even be
implemented in the standard library. At that point, only a lang item would be
necessary for allowing literal coercion.
Of course, at this point, how this extension would actually work is complete speculation. It seems inevitable with this RFC, although due to the large number of unsolved problems, should require one or several more RFCs to stabilise.
In the future, types could be sized in terms of bits instead of bytes, with
(bit_size_of::<T>() + 7) / 8 == size_of::<T>()
. All types would have a bit
size, allowing for a future repr(bitpacked)
extension which packs all values
in a struct or enum variant into the smallest number of bytes possible, given
their bit sizes. Doing so would prevent referencing the fields of the struct,
although the ability to set/get fields is still possible.
For example, here's a modified version of our previous example:
#[repr(C, bitpacked)]
struct MipsInstruction {
opcode: uint<6>,
rs: uint<5>,
rt: uint<5>,
immediate: u16,
}
We could allow referencing &self.immediate
as it is aligned to a byte
boundary, although all of the other fields can only be copied (e.g. self.rs
)
or set explicitly (e.g. self.rt = 4
). Because of this restriction, we would
only allow !Drop
types in bit-packed structs, as Drop
requires a mutable
reference. We would probably initially want to limit to Copy
types, but I
haven't thought much on why expanding to all !Drop
types wouldn't be doable.
All of the ASCII-specific methods for u8
could be copied to uint<7>
, where
all values are valid ASCII. Additionally, longer-term, the potential for using
uint<7>
for ASCII-specific matching and splitting in str
could be discussed.
Additionally, casting from [uint<7>]
to str
could be made safe, as all ASCII
strings are valid UTF-8. This could make code which interacts with ASCII much,
much safer, without needing to do from_utf8_unchecked
or risk the penalty of
from_utf8
.
This is a big change to the language, not to be taken lightly. Additionally,
many libraries may start using uint<N>
for non-power-of-two N
because it
reduces the number of invalid values, even if it's not the best solution. For
example, a library taking a percentage from 0 to 100 might use a uint<7>
,
even though it really should just take a u8
or u32
.
This requires const generics, which have not even been implemented in the compiler at the time of writing. This would be the first builtin type which uses generics at all, too.
No other language provides this kind of generalisation, and it may be more effort
on the compiler side than it's worth. For example, the existing 576 impls for
Shl
might just be better and more performant than four new generic ones.
This would seem closer to the existing types, but has a potential conflict: u
and i
, two names which could easily show up in a program, may shadow builtin
types. Considering how often the variable i
is used for loops, this seems like
a no-go.
Generalising integers over their number of bits is certainly a very logical way
to generalise integers for computers, but generalising based upon bounds is
a very natural way for humans to do it, and more general. For example, instead
of uint<N>
and int<N>
types, we could get away with just one type,
int<A..=B>
. This would be more powerful than the original: for example, a
percentage could be represented exactly as int<0..=100>
. Whether an integer is
signed simply depends on whether its lower bound is negative.
The primary reason for leaving this out is… well, it's a lot different from the
existing integer types in the language. Additionally, such types could be added
backwards-compatibly by guaranteeing that int<0..=2.pow(N)-1>
is equivalent to
uint<N>
, and int<-2.pow(N)..=2.pow(N)-1>
is equivalent to int<N>
. As such,
these types could be added as a future extension.
Quite simply, there are more questions to answer by adding range-based types than size-based types, and the goal of this RFC is to be able to reap the benefits of generic integers before such designs are fleshed out.
Previously, Rust had traits which generalised the integer types and their
methods, but these were ultimately removed in favour of inherent methods.
Going with a generic uint<N>
over an Int
trait would avoid the problem of
determining which methods are suited for these traits; instead, the uint<N>
type would have all of them.
Additionally, having these traits does not allow non-power-of-two uint<N>
outright, and so, this feature is a strict superset of that.
Additionally, having separate uint<N>
and int<N>
types also solves the
problem of generalising signed and unsigned integers: everything would require
one impl for signed integers, and one for unsigned. This would ensure that these
types have exactly the same behaviour for all N
, only differing in the upper
bound on the number of bits.
Many proposals for bitfields involved using macros to create structs, which would ultimately allow the underlying mechanisms for bitfield-based structs to be decided by the ecosystem, rather than the language itself.
This is actually orthogonal to the proposal, as having language-level uint<N>
types means that such macros would be able to return exactly the type necessary
for each field. While something like repr(bitpacked)
would be very far off in
terms of stabilisation, these macros could simply get and set uint<N>
values,
and users could determine how to work with them without having to worry about
checking for overflow or manually implementing code which has already been
written.
So, as far as this RFC is concerned, this is again orthogonal to the existing RFC.
Once const generics and specialisation are implemented and stable, almost all of
this could be offered as a crate which offers uint<N>
and int<N>
types.
There are a couple downsides to this:
- These library types won't unify with the existing
uN
andiN
types, meaning that you'll still have to use macros to define impls for all integer types. - Until compiler support for enum optimisations is made available for external crates, the unused values for these types won't be usable for enum variants.
- If the standard library were to eventually support generic integer literals, then these types would have to be built-in anyway.
- The standard library won't be able to reap any of the benefits of these types, listed in the rest of the RFC.
In the Rust 2018 call for community blog posts, a fair number of Rustaceans expressed a desire for bitfields at the language level, for use in embedded code. So, while having bitfields is non-negotiable, this RFC's existence is definitely put into question.
If this RFC were not accepted, then the most suitable alternative would be to develop a crate in the nursery which is suitable for creating bitflag-based structs in a way that offers robust error messages and reasonable overflow behaviour. However, this would not solve all of the other problems this RFC has mentioned, and this is likely not the ideal path.
At the time of writing, no known programming language offers this level of integer generalisation, and if this RFC were to be accepted, Rust would be the first.
A few other ways of integer generalisation were explored in the alternatives section, like integer traits. A number of languages, notably Haskell and other functional languages, have generalised their integer types over a single trait.
Additionally, Ada offers range typyes similar to those mentioned in the alternatives. I only noticed this near the completion of this RFC and haven't elaborated much on it.
This RFC has a number of unresolved questions, many of which are probably suited to be solved during the development of this feature, rather than in this RFC. However, here are just a few:
- Should
uN
andiN
suffixes for integer literals, for arbitraryN
, be allowed? - What should the behaviour of
uint<N>
andint<N>
be when casting to/from floats? - Should
uint<N>
andint<N>
be the smallest size possible, or have a power-of-two size? - Should
uint<0>
andint<0>
be allowed as constant zeroes? - Should
uint<1>
andint<1>
allow casting tobool
? - Should a sketch of a design for bitfields be proposed before this RFC is
accepted? Would
repr(bitpacked)
even be feasible?