-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Represent integer endianness in the type system #18796
Comments
Can you elaborate on a use-case for this? I think it's a best practice to only handle endianness at I/O boundaries (e.g. file or network streams), and store all integers in memory as native endian. |
@jayschwa Repeated from the post above, a use case would be describing a binary format (for example a file header or network protocol) using a |
So you'd like to be able to: const Header = packed struct {
foo: u32le,
bar: u32le,
};
const header = try reader.readStruct(Header); versus what is done now: const Header = struct {
foo: u32,
bar: u32,
};
const header = .{
.foo = try reader.readInt(u32, .little),
.bar = try reader.readInt(u32, .little),
}; Is that correct? |
@jayschwa Sure, that's one usage example that would be improved. Serializing/writing out a value of the type, and accessing the fields individually would similarly be less prone to bugs. |
Another use-case would be UTF-16. From #649 (comment) (a pointer endianness proposal):
In this proposal's case, that'd be |
Endian as an attribute alongside |
As someone who's now written a handful of parsers and file format handlers in Zig I agree with jay's original comment that it makes the most sense imo to only deal with endianness at the io boundary and keep integers that in-memory native endian |
Re binary format parsing: the field order of packed structs depends on platform endianness too, so for that to actually work, you'd need a flag to override that as well. |
@RossComputerGuy I'm not sure I understand your point - after #11834 @nektro Regarding the repeated "only at I/O" boundary comment - the purpose of this proposal is exactly to make it easier to express endianness at these boundaries. @zzyxyzz What Here is a (longer) code example of how endian-aware backing (naturally) works:
// bits in comments numbered starting with bit 0 (so second byte is range [8, 15])
pub fn Example(comptime BackingType: type) type {
return packed struct(BackingType) {
a: u3, //stored in lowest-value bit range [0, 2]
b: u15le, //stored in bit range [3, 17]
c: u10be, //stored in bit range [18, 27]
d: i4, //stored in highest-value bit range [28, 31]
};
}
const ExampleLE = Example(u32le);
const ExampleBE = Example(u32be); //note: exact same definition, but with different backing type
// example values (@endianCast here inserted assuming we're starting from little-endian)
const example_le = ExampleLE{
.a = 4,
.b = 0xABC,
.c = @endianCast(0x1EF),
.d = -1,
};
const example_be = ExampleBE{
.a = example_le.a,
.b = example_le.b,
.c = example_le.c, //type system sees they're both BE, otherwise @endianCast would be required
.d = example_le.d,
};
// workaround for status-quo
// (note: this means the value of example_be is actually laid out incorrectly in memory)
//const u10be = u10;
//const u15le = u15;
//const u32le = u32;
//const u32be = u32;
const std = @import("std");
const expectEqual = std.testing.expectEqual;
//note that endianness is part of the backing type
//(intFromPacked/packedFromInt as builtins or in std.meta would make usage even simpler)
const BEBacking = @typeInfo(ExampleBE).Struct.backing_integer.?;
const LEBacking = @typeInfo(ExampleLE).Struct.backing_integer.?;
test "read individual bytes" {
// both values hold same value in their 1st- to 4th-valued byte,
// those bytes are just at different addresses.
const ptr_le: *const [4]u8 = @ptrCast(&example_le);
const correct_be: BEBacking = @bitCast(example_be); //enabled by this proposal
// workaround for status-quo: without the proposal, example_be isn't laid out in memory correctly
//const correct_be: BEBacking = @endianCast(@as(BEBacking, @bitCast(example_be)));
const ptr_be: *const [4]u8 = @ptrCast(&correct_be);
try expectEqual(ptr_le[0], ptr_be[3]);
try expectEqual(ptr_le[1], ptr_be[2]);
try expectEqual(ptr_le[2], ptr_be[1]);
try expectEqual(ptr_le[3], ptr_be[0]);
}
test "read full backing" {
const be_backing: BEBacking = @bitCast(example_be);
const le_backing: LEBacking = @endianCast(@as(BEBacking, be_backing)); //enabled by this proposal
// workaround for status-quo: without the proposal, example_be isn't laid out in memory correctly
//const le_backing: LEBacking = @endianCast(@endianCast(@as(BEBacking, @bitCast(be_backing))));
var another_le: ExampleLE = @bitCast(le_backing);
try expectEqual(example_le, another_le);
}
/// workaround: replace all @endianCast-s with this function for running under status-quo.
/// Note that this function ALWAYS SWAPS,
/// whereas proposed @endianCast would be smart enough to know when to swap
/// (would either no-op or lead to a type error if the endianness doesn't change)
fn flexibleSizeByteSwap(x: anytype) @TypeOf(x) {
if (@TypeOf(x) == comptime_int) return flexibleSizeByteSwap(@as(std.math.IntFittingRange(x, x), x));
const byte_aligned_size_bits = ((@typeInfo(@TypeOf(x)).Int.bits + 7)/8) * 8;
const ByteAlignedSizedInt = std.meta.Int(.unsigned, byte_aligned_size_bits);
const swapBuffer: ByteAlignedSizedInt = x;
return @truncate(@byteSwap(swapBuffer));
} This works under status-quo once you activate / swap in the code next to the 4 "workaround" comments, Without this proposal there's no way for the compiler to tell you where you should or should not insert swaps (because we don' tell it via the type system). |
@rohlem |
The source of this information is in the specification for the format/protocol and stored at the module level in Zig not the type system. eg a format/protocol isnt gonna have mixed endianness throughout the duration of those reads and is totally unnecessary to be exposed to the user, eg https://git.sr.ht/~nektro/magnolia-desktop/tree/1ddabe69/item/src/e/Image/qoi.zig#L142-144. this file is called through One could imagine me adding a
|
Commenting that this would be extremely helpful for my use case, as originally proposed by @rohlem ( |
Agree with @mochalins that this feature would be great for firmware and kernel devs (and likely emulator authors as well). I wanted to add that GCC added an attribute for this called [1] https://gcc.gnu.org/onlinedocs/gcc-8.5.0/gcc/Common-Type-Attributes.html |
Regardless of whether its done implicitly or explicitly (I'd much prefer explicit) this design leads to If your intent is to specify how to encode and decode your struct type, I think field tags are a better solution. Chances are you're going to have to handle struct alignment and other metadata there anyways. const Header = packed struct {
foo: u32,
bar: u32,
/// Use `@hasDecl` to discover this in your reader/writer impl and call `@byteSwap` appropriately.
pub const endianness = .{
.foo = .big,
.bar = .little,
};
};
It's also observed in |
@clickingbuttons if I understand your comment correctly, you're suggesting that identifying the endianness of field values via types would introduce more conversions to their values.
Introducing a separate mechanism means that every call site has to respect this additional mechanism, otherwise it silently misuses the values (leading to endianness bugs).
Right, the last point isn't about byte swaps but bit usage within those bytes. If you care about these padding bit locations, even in status-quo you already have to round up to the next-larger integer type ( |
This is where you are mistaken. Every load and store MUST potentially const a = std.fmt.parseInt(i32be, "131072"); // on store, this value MUST be byte swapped or it'd be an `i32` instead of `i32be`
std.debug.print("{d}/n", .{ a }); // on load this value MUST be byte swapped or it will print the wrong value |
@clickingbuttons
Your "on store" example could either be a compile error pointing out the endianness mismatch, or the implementation of In status-quo, you can already remember all the necessary places to |
I understand you want a My arguments are:
|
You're free to use them only at these boundaries. I don't anticipate using them very often either.
By now I hope you understand my point that only uses which need to perform conversion would perform it, so I'll stop reiterating it. For a use case where this would make sense, let's consider a program that listens to a lot of mixed-endian data, and then only has to use a handful of data points from it. The reason for a language proposal then is that I think it's simple enough that it would be useful without being too much added complexity to the compiler.
The main advantage I see is:
I don't see it as critical to the fundamental proposal, and as a user I don't strongly care. If you want my opinion though: Intrinsics should error and tell the user to We could also add a flag |
What I don't quite like about this proposal is that it attaches endianness to integer types directly, even though it's not really a property of the integers themselves, but only of their storage layout, and as such only makes sense within packed structs. Everywhere else, this type annotation would either do nothing (in the best case) or introduce unnecessary byte swaps where the compiler is unable elide them (which should be pretty much everywhere except in-register operations). In my opinion, endiannness should be more like an alignment annotation: const Header = packed struct {
a: u16 endian(.big),
b: u16 endian(.big),
}; Then the compiler could automatically byteswap if necessary on reads and writes. Whether this is actually desirable is another question, though. On one hand, this is clearer and safer. On the other, this will probably cost extra byteswaps in most cases. Lazy conversions are only more efficient if you expect to touch a few fields, otherwise it's probably better to do the entire conversion at the IO boundary, as others have pointed out. |
@zzyxyzz By keeping the information as part of the type, it is more explicit and unnecessary conversions are easier to spot/avoid. That is why I would prefer it, and proposed it this way. It's also more flexible in that every usage site can decide whether to use
Say you want to cache an ID that is conceptually a foreign-endian number, which you intend to send/write multiple times, more often than using it as native-endian in local logic. I agree that the most common use case is for specifying interfaces using In my eyes (and obviously I'm biased) my proposal is at its core already very simple. Maybe by discussing we'll still find something even simpler, or come across some complexity / drawback that I've overlooked. |
@rohlem |
@zzyxyzz I think that information is detailed in the OP, but to clarify: As stated several times now (including in OP), "automatic byte-swaps" are a non-goal. If maintainers decide that we should allow foreign-endian arithmetic Though keep in mind that the toolchain can always optimize everything under the as-if-rule.
|
@rohlem // Native endianness: Little
const S = packed struct {
lil: u16 = 0,
big: u16be = 0,
};
fn foo() void {
const a: u16be = 1
const b: u16be = 2;
const c: u16 = 3;
var s: S;
const p = a + b; // error
const q = a + c; // error
const r = a + 1; // error
s.lil = a; // error
s.big = a; // no error, simple store
s.big = 1; // no error?
} Basically, everything is an error except storing something in a location with the same type and endianness. Note: edited |
@zzyxyzz Yes, that would be my personal preference. |
Then you would be unable to initialize fields and variables without an Another option to consider is not aliasing the normal integers with the native endianness variant. (I.e., making BTW, I'm not yet in favor of this proposal, just thinking out loud :) |
@zzyxyzz I don't see an issue with requiring an
Arguments I can think of against doing this:
Note that this is already the same thing that happens when f.e. using platform-specific APIs: Other build configurations' code paths aren't analyzed, so errors aren't spotted until you build for that configuration.
The universally-correct approach should be to use |
Would it make sense to have big/little endianness denoted by the capitalization? U8/I8: Big Endian |
@expikr Then how would you denote native endianess? |
As an example Linux kernel uses aliases like
|
Duplicate of #3380 |
(EDIT: Finished with initial editing: Added comments on foreign-endian arithmetic and non-byte-aligned
@bitCast
-s that I thought of too late.)(EDIT2: Finished (I think) restoring the post (EDIT3: again) after GitHub gave me a stale edit buffer, swallowing over half of my edits. q.q)
The type system is a useful tool for communicating, between programmers as well as to the compiler, the intent of what is meant by data.
This can be important within a single code base, but is even more important at interfaces between modules.
Clearly denoting where we expect multi-byte integers to be in little- or big-endian format (regardless of the host's endianness) would help discover mismatches.
The basic, minimal version:
std.builtin.Endian
field tostd.builtin.Type.Int
andstd.builtin.Type.ComptimeInt
.Integer types that differ in endianness are distinct, and values do not freely convert between them; a
@bitCast
is required.This would already give interfaces (say
packed struct
-s representing binary formats) the required expressiveness.For minimizing the implementation effort and language change, we can forbid all arithmetic operations on foreign-endian integer types (at least initially)
Further ideas (optional):
be
andle
after current(u/i)N
names, so f.e.u16be
,i32le
).The current type names still exist and alias the host-native variant (with endianness
@import("builtin").target.cpu.arch.endian()
).@endianCast
for integer arguments, which performs a@byteSwap
on the input to output the inverse-endian type.we could also support the input and output type being of same endianness
and make it a no-op in that case. I guess this could be useful to shorten generic code.
@intCast
and@truncate
to work on foreign-endian integer types(output's endianness staying the same as the input's).
Seems like they would be useful, and not too difficult to implement.
This would unburden users from frequent
@endianCast
-s back-and-forth, essentially inserting them behind-the-scenes.Since there's an ever-so-slight performance penalty to the required
@byteSwap
though,I think we can keep burdening users with explicitly writing out the
@endianCast
.but always yielding a native-endian results.
usize != std.meta.Int(.unsigned, @bitSizeOf(usize))
in status-quo,which imo is suboptimal because it may lead to needless function instantiations.
But if that's not an obvious drive-by improvement I can open a separate proposal for separate discussion.)
EDIT4: Probably not an issue, but what to do about non-byte-aligned-sized integers?
The occupied bits in the representation of non-byte-aligned-sized integers differ:
u9
occupies only one bit in the low-address byte. (bit numbering:12345678
,XXXXXXX9
)u9
occupies only one bit in the low-address byte. (bit numbering:XXXXXXX9
,12345678
)Originally I thought that
@bitCast
-s would have to specifically handle this case,however I am no longer convinced this is an issue, because the value's byte-representation is only ever observed
in external interfaces (
callconv(.C)
functions,extern struct
/extern union
), which currently only allow byte-aligned types,and when accessing memory via pointers, where we already insert bit-padding to fill out
@sizeOf(T)
bytes.(EDIT5: I previously mixed up bit-shifting and byte-swapping at this point and freaked myself out about pointers.
But because the bit-order within a byte isn't observably different between endiannesses,
there are no bit-shifts necessary and I see no more issue, even with pointers.)
The text was updated successfully, but these errors were encountered: