Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: usize definition should be refined #5185

Open
ikskuh opened this issue Apr 27, 2020 · 23 comments
Open

Proposal: usize definition should be refined #5185

ikskuh opened this issue Apr 27, 2020 · 23 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@ikskuh
Copy link
Contributor

ikskuh commented Apr 27, 2020

I came across the definition of usize, which is currently defined as unsigned pointer sized integer and a question arose: Size of what pointer? Function pointer? Pointer to constant data? Pointer to mutable data?

For most platforms, the answer is simple: There is only one address space.

But as Zig tries to target all platforms, we should bear in mind that this is not true for all platforms.

Case Study:
Zig supports AVR at the moment which has two memory spaces:

  • Data
  • Code

Both memory spaces have different adressing modes which can be used with the Z register, which is a 16 bit register. Thus, we could concloud that the pointer size is 16 bit. But the AVR instruction set also has a RAMPZ register that is prepended to the Z register to extend the memory space to 24 bit. Some modern AVRs have more than 128k ROM (e.g. Mega2560). This means that the effective pointer size 24 bit.

The same problem arises when targeting the 8086 CPU with segmentation. The actual pointer is a 20 bit value that is calculated by combining two 16 bit values (segment + offset).

Problem:
usize communicates that it stores the size of something, not the address. Right now, usize can contain values larger than the biggest (continously) adressable object in the language and it takes up more space than needed.

C has two distinct types for that reason:

  • size_t (can store the size of an adressable object)
  • uintptr_t (can store any pointer)

AVR-GCC solves the problem of 24 bit pointers by ignoring it and creates shims for functions that are linked beyond the 128k boundary. Data beyond the 64k boundary cannot be adressed and afaik LLVM has the same restriction. I don't think Zig should ignore such platform specifics and should be able to represent them correctly.

Proposal:
Redefine usize to be can store the size of any object or array and introduce a new type upointer that is pointer sized integer. Same for isize and ipointer.

It should also be discussed if a upointer will have a guaranteed unique representation or may be ambiguous ("storing a linear address or storing segment + descriptor")?

Changes that should be made as well:

  • @ptrToInt and @intToPtr should now return upointer instead of usize
  • @sizeOf will still return usize

Pro:

  • Communicates intend more precise by using distinct types for int-encoded pointers and object sizes / indices
  • Saves memory as object sizes may be 50% smaller than pointers

Con:

  • One more type
  • May spark confusion for people who assume that pointer size is always object size

Example:

// AVR:
const usize = u16;
const upointer = u24;

// 8086:
const usize = u16;
const upointer = u32;

Note:
I'm not quite sure about all of this yet as this is a very special case that only affects some platforms whereas most platforms don't have the object size is not pointer size restriction.

Resources:

Edit: Included answer to the question of @LemonBoy, added pro/con discussion, added example

@LemonBoy
Copy link
Contributor

Counter proposal:

Size of what pointer? Function pointer? Pointer to constant data? Pointer to mutable data?

The maximum across all the address spaces. This way we can also keep the ptr-to-usize (and usize-to-ptr) relationship (with the help of the addresspace pointer metadata).

@ikskuh
Copy link
Contributor Author

ikskuh commented Apr 27, 2020

The maximum across all the address spaces. This way we can also keep the ptr-to-usize (and usize-to-ptr) relationship (with the help of the addresspace pointer metadata).

I thought about that, and it has one problem: It will waste a lof of space. @ptrToInt and @intToPtr should use upointer, but if i want to store a size of something (which is the standard case), i should use usize.

This has two advantages: upointer should always store pointers, usize should always store object size. So @sizeOf will return usize.

Otherwise i would waste 50% of my memory with zero padded bytes by storing pointers where i could've used a type only having half of the size.

@ikskuh
Copy link
Contributor Author

ikskuh commented Apr 27, 2020

Added some changes and updates to the original proposal

@JesseRMeyer
Copy link

Since Zig supports arbitrarily sized integer types, each OS could define the bit length of their virtual memory system. This would reduce the well known 'pointer bloat' in the executable.

@ityonemo
Copy link
Contributor

would you have to change the default typing rules on certain arithmetic events? like would the following operations make sense?

usize + usize OK
upointer + upointer NO
upointer + usize OK
usize * iX OK
usize * usize NO
usize * upointer NO
upointer * iX NO

@ikskuh
Copy link
Contributor Author

ikskuh commented Apr 27, 2020

would you have to change the default typing rules on certain arithmetic events? like would the following operations make sense?

No. I don't think this is something that is such a huge error source that it would be a benefit more than a hassle. Subtracting (and thus) adding values of upointer is quite helpful some times, also note that upointer - upointer may still not fit into a usize.

@Vexu Vexu added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Apr 28, 2020
@Vexu Vexu added this to the 0.7.0 milestone Apr 28, 2020
@ghost
Copy link

ghost commented Apr 30, 2020

Is there any defense for isize and ipointer? The only time I can imagine a negative value in those contexts is to check for overflow, or something esoteric. The former can be done with existing operators. The latter could be done with an explicitly safe cast.

The purpose we're trying to imply is "size of a pointer" and "size of a datablock", so I would simply lean toward psize and msize storing "pointer sized unsigned integer" and "maximum-allocatable-memory sized unsigned integer".

This would keep the number of types the same, while increasing functionality and clarity.
isize, usize -> psize, msize

@kyle-github
Copy link

There are OSes that treat pointers as signed. Solaris maybe? That's not a big market :-)

@ghost
Copy link

ghost commented Apr 30, 2020

Real Life Use Case? 🤔

A pointer being treated as signed or not will not affect pointer arithmetic. The only possible situation for signed pointers is some sort of special bit value, but aside from null, what would it be? Zig doesn't even support non-0 null.

In OS's with signed pointers, we can just explicitly cast to the appropriate signed type to shuffle the psize data across language boundaries. I suspect the cast could even be implicit.

@ghost
Copy link

ghost commented Apr 30, 2020

Disclaimer: I don't think the following is a well thought out idea as it stands, but my brain's completely pulling out the stops and I would feel remiss to not write it.

What if we dropped the idea of an intrinsic platform-specific pointer size and memory size altogether, and allowed platforms to define their memory spaces?

Something like this.

const avr = @import("avr");

const avr_code_pointer_size = @PointerSize(avr.code_memoryspace);
const avr_code_memory_size = @MemorySize(avr.code_memoryspace);
const avr_data_pointer_size = @PointerSize(avr.data_memoryspace);
const avr_data_memory_size = @MemorySize(avr.data_memoryspace);

User-defined names are obviously a no-go for cross-platform code, but any code that uses multiple memoryspaces wouldn't be cross-platform anyway, and at worst, converting a snippet to a single memoryspace architecture would be a find-replace.

Feels a bit like vkDevice. Zig intends to be very specific about allocations. This takes it one level deeper.

@ikskuh
Copy link
Contributor Author

ikskuh commented Apr 30, 2020

Real Life Use Case? 🤔

There is sometimes the need to do pointer/type erasure. Primary use case would be anything OS-relevant (as in you're coding an OS). That's where you work a lot with pointers-as-numbers instead of actual memory slices.

Is there any defense for isize and ipointer?

Yes! Memory/pointer distances. You cannot express an object size delta with usize and you cannot express a pointer distance with upointer. You cannot store this object is 15 byte smaller and 240 byte before X with only sized types.

I like the idea of comptime inspectible memory spaces, but upointer should be able to store all pointers and usize should be able to store all sizes.

@ghost
Copy link

ghost commented May 1, 2020

If one memory space could allocate 2^16-1 bytes and the other 2^24-1, using a 24 bit value for both could waste memory more often than not, depending on which address space is used most often?

@andrewrk andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 27, 2020
@ghost
Copy link

ghost commented Nov 25, 2020

(Having heard the rationale for usize, I have decided no.)

I agree with this, but I have an issue with the naming: it's not clear at a glance whether the "size" in usize/isize refers to object size or machine word size. I propose, for object sizes, we use ulen/ilen, for symmetry with slice .len and .ptr, and for ALU words we use ualu/ialu; and we drop usize/isize entirely, so as not to mislead the programmer. I also propose we add udata/idata, for datapath-sized ints (in case an arch can't load a full register in one access, or it can load multiple), and for architectures capable of sub-byte or restricted to super-byte addressing, ucell/icell, the smallest addressable memory unit (so an allocator would return [n]ucell, and @sizeOf(T) returns the size of T in ucells). So, the size in bits of the largest contiguous object is @max(ulen) * @bitSizeOf(ucell).

So the complete list of ints would be:

  • uX/iX: explicit number of bits
    - ualu/ialu: Largest int that the ALU can take as input in one operation
    - udata/idata: Largest int that can be loaded/stored in a single access
  • ucell/icell: Smallest int that can be loaded/stored
  • uptr/iptr: Int sufficient to encode any pointer
  • ulen/ilen: Int sufficient to encode any contiguous object size

While we're at it, let's change align() to take bit values rather than byte values, so we can remove all assumption of byte-addressing from the language.

@SpexGuy
Copy link
Contributor

SpexGuy commented Jan 5, 2021

Re: ualu et al, It's not difficult to put register sizes in a platform header if you are compiling for a target where that's important. IMO the language shouldn't though. We decided not to do the C thing of changing our integer size between targets because it causes robustness problems -- suddenly your calculations may overflow where they didn't before. For this reason, a "register sized integer" would generally be unsafe to use outside of very platform-specific code. usize (and maybe upointer) is an important exception to this rule, because the language itself needs ways to talk about pointer-sized integers and object sizes that are guaranteed not to overflow (for these purposes) and are usable at runtime. This is important for correctness. The language does not need to talk about register sized objects though, because registers are not part of the zig abstract machine. What's the size of a ualu or udata if compiling to JVM bytecode, for example?

In practice, for some targets, it will be important for performance to use register-sized integers. But there are plenty of tools for tracking down performance problems, and Zig makes it easy to modify code in those locations to use faster integers. Tracking down a rare integer overflow that only happens in production on one platform is far more difficult.

There might still be an argument for ucell, because the language does have the concept of addressable units. But that seems like it belongs in a separate proposal.

@ghost
Copy link

ghost commented Jan 5, 2021

There might still be an argument for ucell, because the language does have the concept of addressable units. But that seems like it belongs in a separate proposal.

As you command: #7693

@kyle-github
Copy link

C has become much more explicit about address handling lately. If you have two objects and try to compare their pointers, it is not guaranteed that you will get a result you expect if you have a single, uniform address space. IIRC, it is implementation specific or maybe even UB.

Addresses on many platforms do not behave like integers. They either have very specific wrapping behavior (x86-64 with its requirements on the upper address bits), or have non-unique representations (16-bit DOS with segments and offsets) etc. Here are a few examples:

  • x86-64. The hardware requires that all upper address bits above a certain limit are all ones or zeros. It traps if you screw up.
  • x86-64. Solaris uses both halves of the address space, so addresses can be negative.
  • Aarch64. You can set up the CPU to mask off the upper 8 bits when generating addresses. That means that addresses that do not compare as equal can point to the same object. The addresses might not even have the same sign if the top bit is set and you treat them as signed integers.
  • 16-bit x86. Segments and offsets. Addresses are constructed by (segment << 4) + offset so they are definitely not unique. You can easily have more than one pointer (FAR pointers) that points to the same object in memory but has a different bit pattern.

Pointer arithmetic is hard to get right.

@floopfloopfloopfloopfloop's proposal comes the closest on this, IMHO. Personally I think that usize is a footgun. But then I have dealt with older platforms and code that has to be very portable before.

Sorry for the rant ☹️

@ghost
Copy link

ghost commented Jan 5, 2021

Specifically re. A64, those top bits aren't wasted -- they're used for pointer verification on newer versions. I think this is actually a plus: you could only get a pointer which is invalid in this way by casting a random integer, or going wayyy out of bounds, and the machine should yell at you if you do that.

Re. pointer comparison, if you want to compare pointers you can compare pointers -- IMO the compiler should be aware of memory segmentation and know how to check if pointers are really equal. No need to drag integers into it.

@kyle-github
Copy link

Specifically re. A64, those top bits aren't wasted -- they're used for pointer verification on newer versions.

Oh, they are not wasted! The original plan was to use those for tagged pointers as is often done in Smalltalk, Lua etc. More recently they are also used for memory check tags or whatever they are calling it these days.

By contrast AMD decided to go the other way and make it more painful to do anything with "unused" address bits. I think the jury is still out about which one of these options was the better idea.

Re. pointer comparison, if you want to compare pointers you can compare pointers -- IMO the compiler should be aware of memory segmentation and know how to check if pointers are really equal. No need to drag integers into it.

Think about walking a linked list looking for an element. Every single time you move to a new node, you need to change your new pointer into normalized form for comparison. Whether that is fabricating a 32-bit pointer from the segment and offset on a 16-bit x86 CPU or masking off bits on Arm CPUs, you'll need to do that.

As I just recently caught up to the discussion on Discord where @MasterQ32 gave a very good example of the AVR series and the "joys" of asymmetric Harvard architectures, I tried to expand on the ideas he had: I wrote up more in a comment in issue #653.

@ghost
Copy link

ghost commented Jan 6, 2021

If we really don't want the top bits to be available, we can simply define usize to leave them out, i.e. it would be u52 on A64 rather than u64.

Perhaps I wasn't clear: I think that pointer comparison should not be integer comparison of pointers; Zig has a stronger type system than C, we can distinguish pointers from integers. So, when comparing pointers, the compiler will know the platform and how to convert pointers into normal form or mask them off to compare them -- the user won't have to do that manually. This may mean that equality comparison becomes multiple machine instructions, but we're not aiming to be a macro assembler.

@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 23, 2021
@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022
@ryanschneider
Copy link
Contributor

Is there any defense for isize and ipointer? The only time I can imagine a negative value in those contexts is to check for overflow, or something esoteric. The former can be done with existing operators. The latter could be done with an explicitly safe cast.

C functions that return pointers often use "negative pointers" for error conditions, for example signal(2) returns SIG_ERR (-1) to signal an error. Not sure if this justifies the inclusion by itself but figured it was worth mentioning.

@oldwo
Copy link

oldwo commented Apr 3, 2023

If Zig aims to support diverse platforms, including some yet-unknown future platforms, no false promise should be made that all pointers are equal and can be converted back-and-forth to an integer. usize is exactly such a promise. I strongly favor the idea by @floopfloopfloopfloopfloop that platform headers should provide some function like
const my_usize=@PointerSize(pointertype)
And then you use that my_usize where you need it. There might also be functions to convert a pointer to an integer and back, but not all platforms are required to implement it.

@ethindp
Copy link

ethindp commented Jul 8, 2023

Slicing off the bits that we can ignore (e.g. bits 48:64 on x64) is a bad idea. Primarily because the "size" of a memory address can change at runtime. For example, if you've enabled paging but haven't set CR4.LA57, your address size is 48 bits (bits 48:64 are either zero or one). But if you set CR4.LA57, suddenly your address size changes to 57 bits (if I remember right) (bits 57:64 are zero or one). Or, even worse, the upper bits of an address can have meaning! For example, again on x86_64, if you enable protection keys, suddenly the upper 4 bits or so are the protection key for either user or supervisor mode. I don't really see how a strong type guarantee for usize/isize could be possible given that all of this can change just by flipping a few bits in a control register, and x64 isn't the only architecture that does this (RISC-V and ARM/Aarch64 do this, I'm pretty sure). You can't add probing of this into the language as a runtime feature, either, since, at least on x86 or x64, attempting to access a control register outside supervisor mode is illegal.
Update: the above issue can be (sort-of) mitigated, but not really. On x86 you can check CPUID for 5-level paging enablement/57-bit addresses, and you can figure out the address width using a particular CPUID leaf/subleaf combination, but this is more of a hack than anything else. And other architectures may not provide an equivalent feature.

@perillo
Copy link
Contributor

perillo commented Jan 23, 2024

The C language also has the ptrdiff_t type, the type of the difference between two pointers.

See `https://stackoverflow.com/questions/1464174/size-t-vs-uintptr-t for more details, especially the answer by Alex Martelli https://stackoverflow.com/a/1464194.

If #1738 is approved, ptrdiff should be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

13 participants