-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Even more integers but stricter integer operators #7512
Comments
I kind of love this proposal. If this gets accepted, I would suggest that the "standard integer ops" should allow >> as a "fast, never checked, division for powers of two", and divide should NEVER optimize division by two at compile-time. Possibly also << as a "fast (but checked) multiply for powers of two", noting that in most architectures there is not a difference (but this distinction is important in say embedded systems). Open questions: should bitwise << >> over/underflow with %>> and %<< being checked? |
ps @RogierBrussee some of your asterisks are disappearing on account of github turning them into italicize directives. |
This is proposing to reverse #159 which has been thoroughly considered already, so it's unlikely to be accepted |
What is the range of m32 and b32? Signedness seems like it's orthogonal to wrapping/two's complement behavior, so you would need im32, um32, ib32, and ub32. This just seems silly to me. This proposal is suggesting to add a significant set of new rules to the language. To merit that complexity, it needs to prevent footguns or grant significant new power. It's not clear to me that it does that. It also does great damage to ergonomics. Here is an example of a useful routine I wrote recently to increment a value within a group: const Orientation = enum {
Up,
Right,
Down,
Left,
FlippedUp,
FlippedLeft,
FlippedDown,
FlippedRight,
// before this proposal:
pub fn clockwise(self: Orientation, amt: u2) Orientation {
const int = @enumToInt(self);
// increment bottom 2 bits, keep top bit.
const rotated = ((int +% amt) & 3) | (int & 4);
return @intToEnum(Orientation, rotated);
}
// after this proposal:
pub fn clockwise(self: Orientation, amt: u2) Orientation {
const int = @enumToInt(self);
// increment bottom 2 bits, keep top bit.
const rotated = @intCast(u3, (@intCast(b3, @intCast(m3, int) + amt) & 3)
| (@intCast(b3, int) & 4));
return @intToEnum(Orientation, rotated);
}
}; It's not true in general that an integer is fully one of (adding, modulating, bitwising). No hardware in use these days imposes this restriction. I don't see any footguns that would be avoided by splitting these types. |
There is no such thing as a signed or unsigned m32 because modular arithmetic is like computing on a clock, for any a b: m32, adding 1's to a or subtracting 1's to a both get you to b. Hence there are by design no < , <= and >, >= operators defined on m32. In other words the range of m32 is NOTE here I use the notation 2^32 == pow(2,32) The Orientation example just shows that things get more painful if you insist on going back and forth with the binary representation of an integer. This is of course a useful technique at times, but this proposal insists that you make using the binary representation of an u32 or i32 integer explicit by doing a cast to a b32. It is also true that hardware does not impose these restrictions: hardware just transforms one bit pattern into another, and in fact what the hardware calls add is modular add. Thus if you start from the hardware, an int is just a bit pattern. C famously started out that way, and also identified integers with addresses. It was then found to be useful to abstract to pointers. One could argue that b32 should be C's arbitrary bit patterns, and allow both the arithmetic operations and bitwise operations. Arithmetic is a after all a bit operation as well. However I thought it would be more useful to have a bit more type safety for things like this: fn foo(index:u32, flags:b32) void so YMMV, I can see arguments either way. Update: Using %+, %- as notation for the bit operation of two's complement arithmetic should give you much the same level of type safety, as modular arithmetic as modular types should be rare. Now indeed the rotation example gets a little more painful, but precisely because the wind rose is naturally using modular arithmetic mod 4 one can certainly do better than just casting up and down. ` ` |
Seperating an int-like type without any operands besides |
Here are some Practical advantages of this proposal (see also some smaller updates on the original proposal) for no bitoperations and no modular arithmetic on u8 and i8 consider the following expressions
each of these expressions is subtly buggy in different ways. 1: has undefined behaviour for n == 0 (unlike the C version which is n & (n -% 1) Even Kernighan and Ritchy warn for doing this in C because it is error prone because of precedence rules. Note that in example 2 to 4 the bit manipulations are completely unnecessary: the compiler can change 8*n + 1 to bit operations as an optimisation just fine. For having modular types m8, m16,...
for having a bitfield datatype
unnecessary, so making the need to mingle arithmetic and bit wise operations very rare, and possibly easier to optimise for modern architectures where these operations map to hardware bit operations.
const FooFlags = enum(b5){Flag0, Flag1, Flag2, Flag3, Flag4, LowFlags = Flag0 | Flag1} is a natural way to define an enum that is supposed to be used as flags. Rogier |
@RogierBrussee I do like the bit types. I mentioned something like this in other issues. I was thinking more along the lines of bag-o-bits types where you do not want to be able to do arithmetic operations on them. For instance, taking bytes out of a utf8 stream. Those are raw data, not unsigned 8-bit integers. Or dealing with user address pointers while in kernel space (you definitely do not want to allow dereferencing). |
@kyle-github Thanks for your reaction. I think utf8 bytes as b8 is reasonable, especially because you should not be comparing them with < (because they are not glyphs). But you would still want to know those bytes encode text in utf8 encoding with perhaps something like an @encoding attribute const utf8 == b8 @encoding(.Utf8); with and @TypeO("ß" @encoding(.Iso8859_1)) == [2]const latin1 But that is a whole other can of worms. |
Part of the discussion about the utf8 data pointed out that there are times when you want to have data that is not utf8 compliant. My thought with something like that is to have the concept of a Today you can sort-a/kind-a fake this by creating a struct with a single field with a There are multiple ways to solve this. I would prefer some way to take the basic integer types and remove operators from them:
This is not well thought out, so please ignore the syntax. There is a strong desire in the Zig community to not allow any kind of operator overloading, so the above does not do that: it only provides an allowed list of existing operators. You can only take away, not add. String bytes fall out as possible built-in type because you really, really want people to use them correctly. |
@kyle-github But if you have string literals (and I don't think anybody wants to get rid of those) you need to have something in the language to define what kind of character encoding is used in a string literal. |
Your @MakebitsType gave me an idea for a concise programming type of way to describe what I propose. define: const Operators = enum{ fn @IntegralType(n_bits: comptime_int, order: Order order, ops: []Operators) type Then define //NO well defined order, hence no + or - because there is no well defined overflow, and no ">>", //Has well defined signedness, and all the promiscuity you come to expect from a C type that is a thin abstraction // Same but signed. This is what i8, i32, .. is now. //ABI dependent, // Has well defined order (unsignedness). Implements an abstract unsigned integer of < 2^nbit, so u8 = @IntegralType(8, .Unsigned, b8 = @IntegralType(8, .Unordered, b16 = ... //Implementation of integers modulo 2^n_bit m8 = @IntegralType(8, .Unordered, Rogier. |
@RogierBrussee, I think on the integers you captured where I was going, though with much more mathematical rigor! I was less interested in the integers themselves as in other bag-o-bits types like addresses where you can have strange sizes (i.e. 48 bits on some processors) or strange bit pattern requirements (i.e. x86-64's requirement that all higher bits be all ones or all zeros). I would probably also have an option for how underflow/overflow is dealt with because on DSPs you often get operations for saturating arithmetic. The things I think are important are:
With carefully chosen settings here, you can make 99% of existing Zig code work without change. That would be important for any adoption and to keep things simple. For strings the two problems I was trying to solve were: 1) formatted printing, 2) debugging output. Right now, you do not know if a The reason for having less of a restriction on such bytes was that embedded platforms often do not use utf8 or Unicode in any form simply because of memory constraints. Strings are still printable, but the encoding type is more or less implicit due to the platform itself. Sometimes the encoding is due to hardware and not even software (character displays). In cases like that you may need to add byte data that is not strictly printable on its own (perhaps to change colors on the output). Questions: Is there a benefit to having all these options for common code? It is not clear that picking a single set of behavior for unsigned and signed integers is not sufficient for the 99% common case. E.g. people that want saturating arithmetic can use special functions on existing integers (perhaps using intrinsics that map to specific CPU functions on DSP platforms). E.g. people that are writing operating systems can wrap variously sized integers in structs and construct other types that way and use struct-specific functions to do things like masking, bit field extraction etc. to handle addresses from different protection domains. Zig aims to be simple and explicit. These ideas (which I personally like a lot!) are definitely more explicit, but I am not sure they are simple. If they can be hidden from most programmers by making common definitions then I think this will conform to the Zen of Zig. If programmers are forced to build their own types all the time, then they make the common case too complicated. Is there utility for a string byte type? Here I think the answer is a more clear, "yes." If you have type information about whether something is printable or not, you can use that in comptime code to determine when to print a byte array/slice. You can drop that into debugging info and thus the debugger can guess more accurately when something should be printable. I understand why so many in the Zig community dislike unconstrained operator overloading, but handling variations on integral types is one area where it would be extremely handy to be able to say, "for this type, here is the operation you do to add." Using something like that you could, in pure Zig, define all the integral types, define floating point types, handle saturating vs. modular arithmetic, have real fractional types, handle addresses in a generic way, handle BCD types (still used in mainframes) etc. Note that Zig does have polymorphic operators today: |
I agree that []u8 is suboptimal as a string type, because you cannot tell the difference between a byte and a printable byte.: Even []c_char would be better, I think, except there is no such thing as c_char in Zig (I guess because of an insistence that an integer type has a definite signedness). However that still does not tell you which character encoding is used. I think it is perfectly sane to make Utf8 the default encoding but to be able to specify a particular encoding which is what I suggested to using some sort of @encoding() attribute (which flags the whole string!) or encode the encoding in distinct types like const utf8_byte = enum(u8){} Or any other way to make distinct integral types of definite size. I don't know if mentioning C++ is considered civil behaviour here but in C++ they introduced distinct char, char8_t, char16_t, char32_t, wchar_t types because of different character encodings over the years and FWIW, I think they were right on this one. Questions:
But perhaps the way to go is to just rename the current u --> c_u .. i--> c_i defined as the "promiscuous" types they are now, have something like @IntegralType take a type after all (i.e fn @IntegralType(base :type, ops : Operations) ) and "select" operations from the base type to define u, b, m, c_ushort... and have that imported from "std" thereby making the language smaller. However, conversions between the types are not so simple, and I think just having a few more builtin types is easier. There should still be value in making new integral types. // A 32 bit flag 0, A, B, ~A, ~B, A|B, ~A|B, A|~B, ~A | ~B, ~(A|B), , ~(~A|~B), //An 8 bit type with 256 different elements that can only be tested for (in) equality. //Kernel Adresses that can be compared and are representable as an integer >= 2^16 and < 2^48.
|
While I like the proposed I have sketched out a possible version of the idea here: Excerpt: const u8 = @Type(TypeInfo.Int(.{
.size = .Bits(8),
.interpretation = .Unsigned,
.operations = .{
.equality = .Mem,
.comparison = .Mem,
.arithmetic = .{ .overflow = .Undefined, .div_by_zero = .Undefined },
.modulo_arithmetic = .Mem,
.bitwise = .Complete, // I think this should be null, for u8, but that is orthogonal
},
})); |
@asa-z Thanks a lot for making these ideas concrete, and pointing out that Zig already supports much of it! Because "integer" seems to be a lot of confusion after years of using C-like language which ingrained the notion of "integer" as a collection of bits in a CPU register usually but certainly not always encoding an integral number through two complement. I think that to avoid that confusion it is very important to get terminology right.
So I think a bit more verbosity is needed (it is not like defining new integral types would be common thing to do) and using mathy words to clearly distinguish the mathematical/conceptual from the "C-like language" notions. Perhaps something like the below (just a sketch)
const CharacterEncoding{
const Representation{ const Operations = struct {
}; And with this we have a status quo, promiscuous, C int8_t like u8 which I would like to call c_u8 const c_u8 = @type(TypeInfo.Int(.{ So my proposal boils down to const u8 = @type(TypeInfo.Int(.{
})); const m8 = @type(TypeInfo.Int(.{ const b8 = @type(TypeInfo.Int(.{ And just for kicks let us define UTF and Ascii code units: const UTF8cu = @type(TypeInfo.Int(.{ const UTF16LEcu = @type(TypeInfo.Int(.{ const Asciicu = @type(TypeInfo.Int(.{ The UTFX examples show that more thought is needed for how these "In C you use an integer" classes can be (bit)casted to each other and in particular to usize. Problems like byte order do not magically disappear, although explicitness helps avoid confusion. Rogier. |
Between operators overloading and the invention of square wheels, I choose operators overloading. |
So, I've been thinking, and I think we need to find a much simpler definition, even if it has less flexibility. (Though you should be able to make anything in userspace on top of binary numbers. Here's what I came up with: const Int = struct {
size: Size,
endian: Endian = .Native,
edge: Edge, // perhaps should be named `safety`?
interpretation: Interpreation,
const Edge = enum {
Undefined, //saftey-checked ub
ErrorUnion, // return error union
Panic, //panic, even in RelaseFast
_,
}
const Interpreation = union(enum) {
Arithmetic: struct {
signedness: enum {
Signed, Unsigned, Positive,SignedOnesComplement, _
},
kind: enum {
Modulo, Bounded
},
},
Binary: struct {
bitwise_defined: bool,
shift: enum {
Arithmetic, Binary
},
}
Token, // Or Cases, or Enum, for hings like enums and codeunits
}
}; However, there is one big problem: this does not allow you to natively represent C-style integers. I have two solutions to this, but I am not particuarly fond of either: Solution 1: Add a const Int = struct {
size: Size,
endian: Endian = .Native,
edge: Edge, // perhaps should be named `safety`?
interpretation: Interpreation,
const Edge = enum {
Undefined, //saftey-checked ub
ErrorUnion, // return error union
Panic, //panic, even in RelaseFast
_,
}
const Interpreation = union(enum) {
Arithmetic: struct {
signedness: enum {
Signed, Unsigned, Positive,SignedOnesComplement, _
},
kind: enum {
Modulo, Bounded
},
},
Binary: struct {
bitwise_defined: bool,
shift: enum {
Arithmetic, Binary
},
}
Token, // Or Cases, or Enum, for hings like enums and codeunits
C: struct {
signedness: enum {
Signed, Unsigned
},
}
}
}; Solution 2: Make the interpretations not mutually exclusive const Int = struct {
size: Size,
endian: Endian = .Native,
edge: Edge, // perhaps should be named `safety`?
arithmetic: ?Arithmetic,
binary: ?Binary,
token: ?void, // or void?
const Edge = enum {
Undefined, //saftey-checked ub
ErrorUnion, // return error union
Panic, //panic, even in RelaseFast
_,
}
const Arithmetic = struct {
signedness: enum {
Signed, Unsigned, Positive,SignedOnesComplement, _
},
kind: enum {
Modulo, Bounded
},
};
const Binary = struct {
bitwise_defined: bool,
shift: enum {
Arithmetic, Binary
},
};
}; |
This is functionally identical to the existing unsigned API, but better indicates intent for values that are known to be bitmasks. It would also align with any future Zig support for raw bit types: see ziglang/zig#7512 and ziglang/zig#8388.
Zig already has 20 different basic integer classes. It seems Zig inherited from C that an integer is characterised by signedness, size, and it adds a a further attribute "length is determined by C-ABI, explicit, or pointer size". On these integers the main operators are
+, -, , / (can overflow)
%+, %-, % %/ (does not overflow)
&, | , ^, <<, >>
==, !=, <, <=, >, >=
!
%
lots of @functions including arithmetic with explicit overflow e.g.
@addOverflow(T; type, a:T, b:T, r:*T) bool
This is clearly inherited from C with the exception of the modular operator %+, %-, %*, which are Zig specific
I propose to separate (potentially overflowing) arithmetic from modular arithmetic and from bitwise operators of different sizes and let them be defined on their own integer type
i8/16/32/64/128 --> +, - , *, ! , @divTrunc(), @divFloor(), @Rem(), @mod(), ==, !=, <, <=, >, >= (signed comparison)
u8/16/32/64/128 --> +, - , *, /, ! , %, ==, !=, <, <=, >, >=, (unsigned comparison)
m8/16/32/64/128 --> %+, %-, % *, ==, != , !
Update: % as an operation from say m32 -> m32 makes no sense for modular ints,
but conversions like m32 --> m17 are well defined and harmless and can be expressed with @as(m17, n)
Update: modular division is slightly subtle see open issues.
b8/12/32/64/128 --> &, |, ~, ^ , <<, >>, @shra(),@Rotl(),rotr(), (+ more bitops), ==, !=, !
Mathematically this is natural: it makes a distinction between the integers Z which are approximated by i64 (in the same sense that real numbers are approximated by f64), the integers modulo 2^64 Z/2^64Z which can be exactly represented by the 64 bit in m64, and a 64 bit bit vectors bitvectors {0,1}^64.
Some open issues:
naming:
m8/16/32... , b8/16/32/... is very minimalistic, explicit modular8/16/32/.. and bits8/16/..can also be argued for.
Notation:
Is there still a need for the notation %+, %-, %* when the the type already makes explicit that modular arithmetic is asked for.
This is less explicit but not different from the overloading of +,-, ... between the different integer types and floating point.
Intermediate types and (in)equality
Intermediate length types work just fine. E.g. a type m3 would simply be represented by 8 or 32 bit bit with arithmetic operators. The %+, %-, % * (see below for %/) work just fine using the corresponding arithmetic twos complement operators (because that is modular arithmetic!) as long as (in) equality is properly defined i.e. e.g. in m5
a == b
is equivalent to
(a - b) == 0
which is equivalent to
@intcast(b8, a - b) &0b11111 = 0
which is e.g. equivalent to
@intcast(b32, a)&0b11111 = @intcast(b32, b)&0b11111
Conversion:
Conversions like
i32 --> m32 and u32 --> m32 are a nop on the bitlevel and respect +, -, *, !=, == so it is harmless. Even conversions like
m64 --> m32 or even m32 --> m5 are harmless and can be a nop if one does lazy normalisation and only does it when (in)equality needs to be computed, so conversions i64 --> m5 are like wiseharmless.
The converse conversion m32 --> i32 is sign extension, while m32 --. u32 is zero extension so may be non trivial, and in any case represents a choice, mainly what one means with <. Thus it requires a cast.
It would suggest that i32--> b32 and vice versa also both require a cast. This separates the bitops and artihmetic operations and gives extra type safety to things like flags argument of a system call.
Semantics of %/ vs @ModDiv
Mathematically, the precise semantics of %/ when dividing by a power of 2 is iffy. It should overflow just like dividing by 0 overflows (i.e. is undefined behaviour) (However e.g. if x: m64 and x == 0 in m32, pow(2, 32). then x %/ @pow(2,32) is well defined in m32. Hence probably better to have @ModDiv() (which uses the Euclidean algorithm and can overflow), and @modDivExactPow2()
Bitops
The b bitfield type has << and >> shifts. The latter is logical right shift. The arithmetic shift is provided as @shra. (Arguably << and >> are better off and less error prone as @shl(), and @shr() (or even more arguably @shiftup, and @shiftdown), and be just one of the more commonly used bit operations).
Powers and xor.
The above makes mixing arithmetic and bitwise operators unpleasant, because you have to cast. The compiler should have no trouble changing a:u32; _ = a % @pow(2, 12) (or a: u32 : _ = @as(m12, a)) to bit operations under the hood. ( However, IMO, the @pow(2, 32), while not terrible, is less than optimal. The obvious solution is to use ^ as the power operation on signed unsigned and modular integers and use @xor() (or # or an infix @xor@ or :xor: or just xor) for the xor operation. Zig uses 'and' and 'or' for logical and and or and they are a lot more common than xor)
comparison ops @ltu(), @leu(), @lts(), @les() for modular ints:
The Boolean function fn ltu(n:m32, m:m32) bool{ @intcast(u32, n) < @intcast(u32, m)} is of course perfectly well defined but has no good properties like a < b implies a + c < b + c, and undefined behaviour should not sneakingly give those, so an intrinsics for undefined and signed comparison are probably a good idea.
Rogier Brussee
The text was updated successfully, but these errors were encountered: