-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: ubyte
/ibyte
for smallest addressable unit of memory
#7693
Comments
C has CHAR_BIT so perhaps something like that? Perhaps a const |
|
I want to say this won't work, since |
It's fine since the language spec doesn't need to define the implementation of memcpy/memset, just the header. The fact that you can't implement it in a portable way is unfortunate but not required from the language itself, the std lib could fill that gap. |
In Go, I do like having a distinction between One thing I don't understand about this proposal is why there should be two variants for signedness. Isn't having an opinion on the signedness of the value an indication that it should be an integer type? |
It is an integer type. That's how we represent memory in Zig. We don't have a bit vector type, and I'm not proposing one. |
We actually may need to change that; it causes undefined behaviour in hidden and unexpected places. e.g.
|
Ok, sure. But that's an orthogonal concept. |
I don't know about that. What if we have a
|
We could do the bag-of-bits thing at any size: define |
@EleanorNB, do you propose that |
That is the idea, yes. Also, I'm not entirely clear on the difference between a platform that requires all memory accesses to be aligned to at least some size and a platform with that size byte. They are the same thing, are they not? |
I think a If a project assumes that comptime {
if(@bitSizeOf(byte) != 8)
@compileError("This project requires 8-bit bytes!");
} |
They are. I was just subtly hinting that no additional language features should be necessary to support hardware that has multiple-of-8-bits addressable units 😄. It would probably be a matter of setting the proper alignment in the allocator interface and taking a minor performance penalty when working with The more general case of arbitrarily-sized bytes is a different matter, though. A native In other words, either you don't care about performance very much, in which case you simply let the compiler generate the appropriate mask/unmask operations all over the place and get portability for free. Or, you do care about performance, in which case you need to write your program specifically using It might be possible to write width-agnostic code (using only |
I like the symmetry of the "bag of bits" data type with the i* and u* types. I.e. i1, u1, bits1; i2, u2, bits2... i16, u16, bits16... Going off on a tangent... It might be nice to extend that symmetry to other bag-o-bits types. The generic bag of bits above could be a base type to which all the others coerce automatically. The bag of bits type(s),
Addresses might be another one as a base type for pointers. All pointers could coerce automatically down to that type. The base unit would be @EleanorNB's address unit type in this proposal. As above, to coerce to a typed pointer, you would need to use And I'll stop there with the tangent. |
@kyle-github, your idea may be related to (#7512), which proposes to separate signed, unsigned and modular integers from bitstrings. Sure, doing things like that has a certain elegance from the point of view of symmetry and separation of concerns. But what do we really gain from this, other than additional |
EDIT: PEBKAC problem using @zzyxyzz thanks for the pointer to the other issue. It seems vaguely related in so far as it proposes new numerical behavior. But that is not what I am thinking about here. This is sort of a half-baked idea, so hopefully the stuff below makes sense! A key point is that raw bits should have relatively few allowed operations. Mostly masking and bit-level operations like and, or, xor. Shifting makes some sense as does rotation. Arithmetic operations on raw bits make no sense. There is no interpretation of meaning beyond 1 and 0 and position for each bit. As to casts, I think there would be relatively few. The advantages I see are:
The first point is pretty much what we have now, but with guarantees about the results of operations across all platforms. That said, the exact operations and guarantees can be carefully chosen to make sure no one is surprised and implementation on common platforms is not overly painful (Java is both a good and bad example here). If I use an i8, I should get exactly the same behavior across all platforms regardless of what the CPU wants to do. It might be costly, but I should get identical results on all platforms. The second point has two advantages. The first is that you have a way to cleanly drop to the platform representation via automatic down coercion. If you need to get access to the raw bits without Zig applying an interpretation or restrictions on them, it is trivial to do. That should reduce Quick example: Suppose I am building a VM and I am going to use NaN boxing. In current Zig you would need to Once you pull out the payload, you will need to either If you drop it down (without needing a cast) into a raw bit type, then you can mask it and extract the parts you need and then do a single, final Sorry, I meant to keep this short! I need AA for long-posters :-( |
I must admit I still don't get what the advantage of a raw memory type would be.
Do they? I was under the impression that IEEE-754 defines a precise bit encoding. Some architectures are not fully IEEE-754 compliant in other areas, such as rounding modes or handling of subnormals, but I'm not aware of architectures that change the encoding. (I could be wrong about this. Are there any?) Byte order should also not be relevant, since it applies to memory layout and not to in-register values. If you
If you do bit-twiddling on float values, you are indeed on your own. But once you decide to do it, working with bit strings isn't really much safer than working with unsigned ints. Besides, most bit-twiddling operations require both logical and integer operations. For example,
What do you mean? Once you extract the payload and cast it to the appropriate type, all's good, no? |
You would think so, but no. See endianness and floating point in Wikipedia. Some processor use little-endian ints and big-endian floats or vice versa. YMMV. I have hit this before and it sucks. I think I hit it on MIPS.
I disagree on the safety, but your points about common bit-twiddling are important ones. That said, most of this kind of bit twiddling should be intrinsics or compiler built-ins as a lot of CPUs now have single instructions for them and people should not be doing it themselves. If I saw this in new code today I would consider it a code smell. For instance, a lot of masking is due to a lack of easy bit-field extraction operations. I have seen cases where using type punning into bit-field structs in order to extract fields ends up producing better code. I do some low level protocol programming right now (not my day job) and I am so trained to do it as you note that I do not even notice when the compiler supports a better/higher level way to do the same thing. Part of the idea here is to make people like me who have done this too much think again :-)
You left off the final sentence of that paragraph:
Zig does not help your remember to do that last cast to u48. It is that last part that I am trying for. Make it both the easiest path to do the right thing and make it harder to do the wrong thing. I see a lot of code (to be clear, in other languages) that isn't doing that final Let me make it clear: I am not sure that there is either sufficient benefit to this to justify it or that there is any interest, so I really appreciate the time you are taking to think about this and respond. Your point about bit-twiddling brings up some good cases. |
The key issue I ran into was that if any bits are undefined in an integer, the whole integer is undefined. |
Interesting, @daurnimator! The undefined bits would come from things like padding or unused bits in an int that was not a machine word size? I wonder if the behavior of undefined should continue to be like SQL's NULL. For instance, OR of a 1 against any value is 1, so should OR of 1 against and undefined bit still be undefined? Similarly AND with 0 will always return 0. Maybe that is too complicated... |
Hey everyone, the bag-of-bits idea is *completely irrelevant to this proposal. Stop discussing it here. Go to #8388 which is specifically about that. |
@EleanorNB I would assume the main blocker for this are non-LLVM backends, as LLVM does not support non-8bit char. It would be also useful to estimate how this would affect the complexity and performance of the compiler. |
@matu3ba Additional reading: |
Currently, it is assumed that all hardware uses
u8
as the address unit, which is not universally true. However, this assumption is built into the language or to things fairly central to the language's use in places, such as@memcpy
/@memset
and the allocator interface. The language has a concept of address units inalign
/@sizeOf
, but provides no direct interface to this.I propose the simplest solution:
ubyte
andibyte
types, representing the smallest addressable memory size, scaling with the platform just likeusize
does. So, anywhere in the language where "raw memory" is required would use[*]ubyte
rather than[*]u8
, and be portable universally.Alternative names:
ucell
/icell
,unit
/init
(but pls no)The text was updated successfully, but these errors were encountered: