-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more pointer metadata: address spaces #653
Comments
how about: const default = 0;
const mapper_hardware_ram = 1;
var foo: u32 align(4) addrspace(mapper_hardware_ram) = 10; why do we want an addrspace type? seems like it will cause more problems than it will solve. like, in your example, if we inline |
Nim has it under name "Memory regions" ( https://nim-lang.org/docs/manual.html#types-memory-regions ). AFAIK it is pretty useless, but Nim managed to drop a feature only once. |
I believe the Nim "Memory regions" are a language level feature used for extra safety when dealing with pointers to possibly semantically different memory areas such as heap, stack, objects from another language and so on. As on they only exists in source but not in the compiled code. The gcc and llvm address spaces however are a backend specific information required for correct codegen when memory and code exists in different memory. |
@Ilariel: you are right. |
A lot of embedded systems use various address spaces. These are usually extremely proprietary and are totally non-portable in C. It would be nice to have some sort of structure in Zig to handle these cases. It would make Zig a LOT nicer than C in this regard. Examples of address spaces:
|
I accepted this, but I'm going to be sure to feel the need for it in my OS project before implementing it. |
This could be used for TLS: (and %gs is unused, so we could use it for per-cpu data (why might we need this?)) |
I would like to voice my support for this feature. I have been experimenting with Zig's AVR support recently, and have experienced some friction due to AVR's separate program and data address spaces. |
There's potential for this feature to be a bit more powerful than distinct types for pointers. In #4284 (comment), @vegecode describes a use case in embedded where unaligned access is valid in some memory regions but invalid in others. His solution in C was to use volatile on pointers to the aligned-only memory region to prevent the compiler from optimizing his aligned loads into larger unaligned ones. But if the address space could convey information to the optimizer about whether unaligned access is valid it might provide a more targeted solution to this problem. |
I'm not so sure about that, since the addrspace info is more than just a way to distinguish pointers; it is relevant (and important) to the backend. For example, LLVM's AVR backend will recognize copies from program memory to data memory, and insert the special instruction required to do so. Distinct types would prevent an addrspace(1) pointer from being passed to a function that expects an addrspace(0) pointer, but it doesn't seem like that issue allows for that information to be passed to LLVM. |
Are we sure we shouldn't express address spaces with enum variants? That seems like a better option than ints. We could reserve an enum name, and allow the root source file to implement that enum with whatever variants it needed: // 'Segment' is a reserved name, like 'main' or 'panic'
const Segment = enum {
.rom,
.flash,
.write_only,
.ram_fast,
.ram_slow,
// etc.
} Then segment pointers would accept enum variants, like |
Another example where this would be useful is the upcoming multi-memory proposal for WebAssembly. |
Here's a thought experiment as a driver. I came up with this from @MasterQ32's example of AVR on Discord and my own experience with ugly MCUs. This came out of the discussion in #5185.
One approach to supporting this in Zig would be to do something like this:
For x86-64, you would have two address spaces, one for code and one for data. They would have almost identical definitions and would allow comparison of pointers between each other. You could have a per-platform setting that designates default address spaces for pointers that are not given an explicit address space. Or that could be a bool in the address space definitions for the platform on each address space, something like I am sure that there are more features that I am missing (such as write blocks for Flash). Then have Zig enforce the following:
|
Surely, if we're considering fixing the roles of different address space numbers, we should make Re. banks, this may be considered an orthogonal concept, declared separately from address space: |
Banks make absolutely no sense. They can always be combined into a linear map of a single address space, perhaps with peculiar alignment rules. As we already have to handle alignment rules, this is nothing new. |
@shawnl not sure what you are saying here. I was thinking of situations where banks are used for overlays so the address as seen by the CPU of data in one bank is identical to the address in another. I think in general that trying to tackle that along with all the rest is biting off too much, though! I'll edit out the bank stuff. |
@EleanorNB thanks for the thought on this. I very much like the idea you have with the different AddressSpace subtypes in an enum. Perhaps I misunderstood part of your proposal, but Agreed on banks. I was getting too complicated and those are not address spaces. Both you and @shawnl pointed that out. I edited out the bank stuff from the example and from the rest of the proposal. One of the goals of this is to make simpler platforms like Aarch64 and x86-64 require no address space annotations at all. Existing code should work without change and there should be no surprises. My internal question was, "how can this be retrofitted into what exists today without breakage but still provide reasonable support for real hardware?" Only when you want to program AVR-based systems or other weird platforms would you need to care and need to annotate/type your pointers. That's fine because that code is deeply platform specific anyway. It would allow you to check your own code and make sure that Zig (and LLVM) have sufficient data to generate correct code for the platform. You do want Zig to catch it when you try to compare a pointer to data in RAM against a pointer to data in Flash if the address spaces are such that they can overlap (and thus are marked as not comparable). Or when you try to assign a pointer from a function in one space to a pointer to a function in another space that are not compatible. You do want LLVM to generate the correct code to access a function in ROM if that is different from how to access it in Flash. There are a number of different possible ways to record the address space info. For instance, rather than having separate address space entries for code and data for a single physical medium, perhaps a small enum field with three possible values |
Two useful OS justifications for this: The Linux kernel (and possibly others) use This is because if a user program passes a kernel pointer in a system call (say for example read/write/mmap), the kernel must catch that instead of accidentally dereferencing it in kernel space, which would allow the user to access kernel memory. Linux uses these macros: # define __user __attribute__((noderef, address_space(1)))
# define __kernel __attribute__((address_space(0)))
# define __iomem __attribute__((noderef, address_space(2))) Used like this:
Having support for this in Zig would make this safe, statically verifying there are no unverified or accidental user/kernel boundary memory accesses that could lead to nasty security and correctness bugs.
@andrewrk mentioned "feel[ing] the need for it in my OS project before implementing it" -- I think this is a great use case for it, and something Linux uses extensively in kernel and driver (especially important since it may be third party modules!) code. |
I also ran into the need for this when playing around with Zig and LLVMs AMDGPU target. |
Thanks for the example use cases @BinaryWarlock and @Snektron. The key things here are that you want to make sure that you cannot accidentally mix pointers from one domain (user) into another (kernel) or between mappings as seen in one area (main CPU) vs. other hardware (GPU), right? This may be a bit of an intersection between some ideas I put in this issue and in #7693. You need to be able to control things like comparison and assignment. You also need to be able to implement translation between address spaces (user to kernel for instance). For that the bag-o-bits type would be useful as it has no interpretation. Take @BinaryWarlock's example. If you are in kernel space, you want the constraints on a user pointer to prevent dereferencing. @Snektron can you give an example of what exact problem you hit? |
@kyle-github Correct, they're essentially separate namespaces for pointers that you cannot mix. I was assuming the translation would happen by I don't have a particular preference on how it's implemented (as long as the semantics allow for that -- not being able to dereference/mix pointer address spaces), but I'm sure it could be generalized with general pointer metadata/tagging or something. |
One of the offshoots of the bag-o-bits ideas that are floating around would be to take the user pointer, assign it to a bag-o-bits type removing all useful typing. Then do the appropriate hardware/software lookup to translate a user space pointer into kernel space (IIRC due to things like PAE and uneven kernel/user address space splits this can get really funky involving PTE lookups, but it has been a long time since I trawled through kernel code). Then you would |
Should If there is no problem with moving |
GPUs have a few different types of memory, with different purposes. For example there is general-purpose global memory, but there is often also a per-shader (private) core and per-compute unit (consisting typically of a group of 32 or 64 shader cores) (local) memory which is smaller and a lot faster. The OpenCL programming model extends and generalizes this to include at least 6 different address spaces, and LLVM's AMDGPU backend uses this model as well. From the LLVM AMDGPU backend docs for example:
Note that the flat address space can be used for both global, local and private data, but this is not supported on every target machine, and requires manual setup. Also note that while the global and constant address spaces in fact refer to the same virtual memory addresses, values specified to lay in the constant address space are assumed to not change for the entire duration of the kernel (as opposed to C's 'constant' pointers which are not really constant), which could improve efficiency. This also highlights a possible intersection with issue #5185: Address spaces could be indexed with different size pointers, and the actual size of |
I have thought a bit about this, and i think i have a decent concrete idea. To summarize: the core idea is that variables may be placed in different address spaces, which are architecture-specific. Loading values from these address spaces may require different instructions depending on the address space, and so pointers are required to know in which of the available address spaces the value lies. SemanticsEvery variable and pointer will gain a mandatory address space attribute, which may be inferred from the context of the variable declaration or pointee. For example, depending on architecture (in specific: SPIR-V), pointers to locals, globals and parameters may be of different address spaces. If a variable is declared inside a function, its address space should be inferred to the default address space for function locals (in the case of SPIR-V: CastingCasting between address spaces is typical dangerous behavior, and so i argue that this should only be allowed through the nuclear option of SyntaxPointers and global variable declarations will gain another optional attribute, with syntax const progmem_i32: i32 addrspace(.progmem) = 10;
const progmem_i32_ptr: *addrspace(.progmem) = &progmem_i32; C interopC compilers like clang and GCC support address spaces as compiler-specific attribute. For example, gcc and clang accept the syntax User-defined address spacesAs pointed out by @BinaryWarlock here, the Linux kernel uses this to prevent accidental dereferencing of user-space pointers. I believe that we should not support this case, at it leads to much additional complexity. For example, what if a user-defined address space is required for the non-default address space? I believe that this use case should be handled by opaque types. For example: fn UserPtr(comptime Child: type) type {
return *opaque {
fn deref(self: @This()) Child {
return ...;
}
};
} Unresolved
Additional notesThread local variables are typically handled by an address space internally. For example, on x86 thread locals are typically implemented using one of the segment registers. While i dont think that it's very ergonomic to use |
I did some digging into whats up with the other address space having LLVM back ends:
|
GCC has the concept of address spaces, which is useful for embedded programming: https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html
In LLVM pointer types have the concept of the "address space" of the pointer: http://llvm.org/docs/LangRef.html#pointer-type
This is easy to support. Just like alignment, global variables can specify the address space that they are in. Just like alignment, pointers can specify their address space, and if it's the default address space, it can be omitted.
We will use a simple integer for the address space. 0 is the default. If an application wants to have a name for an address space, it can assign an integer to a constant. If an application wants to coordinate address spaces with a package it depends on, the package should accept a configuration option to specify the integer mapping for a given address space name, and then both the application's constants and the package's constants will refer to the same integer.
new keyword:
addrspace
It can be used to create an address space constant:
It can be used in the pointer syntax:
&addrspace(mapper_hardware_ram) u32
The type of
mapper_hardware_ram
isaddrspace
. The only thing you can do with it is use it in pointer syntax and global variable syntax.Global variable:
Implicit casting and explicit casting does not allow changing address space of a pointer.
However you can use
@addrspaceCast(addr_space, ptr)
to (unsafely) override the address space of something.The text was updated successfully, but these errors were encountered: