Add large code model information. #388

kuanlinchentw · 2023-06-16T06:01:21Z

Hi,

This PR add description about large code model.
I was wondering if we need large+fpic model.
In general, position independant code model puts external symbol addresses into the GOT table.
Is there any case that we have to layout GOT table far away from code over +-2GB?

riscv-elf.adoc

rui314 · 2023-09-27T04:57:18Z

I think I'd prefer to define a set of relocations to materialize a 64-bit address with four instructions and let the linker to relax it to 1 to 3 instruction depending on the offset to the materialized address. That approach is easier to implement than the address pool and doesn't need a writable text segment.

I'd also think it could be faster than reading addresses from the address pool because 1) the processor could fuse 3 or 4 instructions into a single macro-op, and 2) loading an address from the address pool is just a waste of resources if the materialized address happens to be not too far from PC.

kito-cheng · 2023-09-27T09:24:29Z

@rui314 I am not sure if we can generate any arbitrary 64 bit address within 4 instruction? did you mind share the instruction sequence?

rui314 · 2023-09-27T10:17:21Z

@kito-cheng Apologies, we can't materialize a 64-bit value with four instructions in RISC-V. We actually need six instructions to, for example, load a value from an arbitrary 64-bit address as follows:

lui   t0, <highest20>
addi  t0, t0, <higher12>
slli  t0, 32
auipc t1, <hi20>
addi  t1, t1, t0
ld    t1, <lo12>(t1)

which can be relaxed to the following 5 instructions if the symbol is within ±2^44 bytes

addi    t0, zero, <higher12>
c.slli  t0, 32
auipc   t1, <hi20>
addi    t1, t1, t0
ld      t1, <lo12>(t1)

and of course to the following two instructions if it's within ±2GiB.

auipc   t1, <hi20>
ld      t1, <lo12>(t1)

It looks to me that the RISC-V psABI's design choice to allow the linker to shrink the section really shines for this use case.

jrtc27 · 2023-09-27T12:34:52Z

Creating new ABIs that only support position-dependent code seems like a bit of a questionable thing to be doing in this day and age

kuanlinchentw · 2023-09-28T00:49:49Z

I think using constant pool for large model doesn't cost so much. Because compiler can use anchors to tag variables, and load each variable just by its offset from the anchor.
Ex:
If we want to get values of global variables A and B. We don't have to load constanct pool entries twice for A and B.

auipc t0, hi20(.LC0)
ld       t1, t0, lo12(.LC0)  
lw      a4,0(t1) 
lw      a0,4(t1)  
.LC0:
       .dword  .LANCHOR0
       .bss
       .set    .LANCHOR0,. + 0
a:
       .zero   4
b:
       .zero   4

rui314 · 2023-09-28T05:10:47Z

I know there are many extremely large programs out there that might already need the large code model, but to my knowledge, most of these programs are server-side and run in datacenters. They naturally need to be built as position-independent executables, and their text segments need to be read-only (or execute-only if possible). This made me wonder about your motivation to define a position-dependent-only ABI in the first place.

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

MaskRay · 2023-09-28T05:25:39Z

I have some notes about large code models in aarch64/powerpc64/x86-64: https://maskray.me/blog/2023-05-14-relocation-overflow-and-code-models#aarch64-code-models

I know that certain JIT programs may use large code models, possibly just the position-dependent form.

I think using constant pool for large model doesn't cost so much. Because compiler can use anchors to tag variables, and load each variable just by its offset from the anchor.

Agree.

For server side large x86-64 applications, they can use the medium code model. This larger range makes it unlikely for AArch64 to encounter relocation overflow issues before the binary becomes excessively oversized for x86-64.

aswaterman · 2023-09-28T05:40:06Z

to my knowledge, most of these programs are server-side and run in datacenters. They naturally need to be built as position-independent executables, and their text segments need to be read-only

Without commenting on the merits of this particular code model, I'll remark that there is a distinct and very real use case: RV64 embedded systems, which might not consume that much memory in total but need to cope with a sparse address space. The text/rodata might be separated by gigabytes from the absolute-addressed I/O, and there might be multiple regions of each. There's no virtual memory, so it isn't possible to remap the relevant regions to improve virtual spatial locality.

kuanlinchentw · 2023-10-02T03:34:46Z

Actually, using constant pools as the large code model can generate position-independent executables. It only needs the static linker to leave dynamic relocations for the loader or the memery manager to add the offset when executables are remapped.
In my first comment, I was just wondering if there is the real case that we need large+fpic.

jrtc27 · 2023-10-02T03:43:23Z

Yes, constant pools are equivalent to a hand-rolled GOT.

kuanlinchentw · 2023-10-02T07:45:48Z

Yes, constant pools are equivalent to a hand-rolled GOT.

Yes. It's a nice description. Thanks.

rui314 · 2023-10-02T08:09:22Z

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

kito-cheng · 2023-10-02T10:39:38Z

lui t0,
addi t0, t0,
slli t0, 32
auipc t1,
addi t1, t1, t0
ld t1, (t1)

Can use lui rather than auipc? I think all using lui would be easier to shared the high-part (first 5 instruction)? that should be able let compiler share the high-part between different low-part?

Use auipc we may either enforce whole instruction sequence must together or has a relocation let last instruction point to the auipc instruction like PCREL_LO12_*.

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

I involved the design and implementation of this code model when I still collage with @kuanlinchentw, so I guess I can give few detail from my brain dump: that design come with several advantages: 1) simple to implement, because it can be borrow the implementation from AArch64 :P, 2) NO new relocation required.

However the disadvantage is obviously: 1) every address need load from constant pool, 2) the pool has duplicated entries.
But we think the disadvantage can be ignore in most use case of large code model, since it mostly used when MMU-less situation, and also we have ePIC proposal, that could address some special use case in embedded world.

IIRC, long instruction sequence scheme also has discussed before in somewhere (publicly?), but it just come with more overhead to implement: new relocation and new linker relaxation, also psABI TG isn't exist in that moment, so we are trying to prevent touch psABI as possible at that moment.

kuanlinchentw · 2023-10-03T02:21:05Z

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

As @kito-cheng mentioned, It's easy to implement at the compiler veiw, and it doesn't need to modify binutils.
For compiler, each variable access can be a dependent load intruction after setting the anchor value.
This can avoid using lots of pseudo intructions that may not scheduled apart.
We might consider the way that using a set of relocations to materialize a 64-bit address before.
But there is a trade-off between the compiler scheduler and the linker relaxation.
If the compiler expands the instruction sequence to schedule, it's hard for the linker to relax.
Even if linker can recognize the sequence and relax, the delete instructions may affect the schedule result.
And the disadvantages as @kito-cheng mentioned, I think it's still an issue.
Obviously, it waste the space to save redundant entries. Maybe the compiler can generate the mergable constant sections to reduce the harm.

rui314 · 2023-10-03T05:43:41Z

If no new feature is required for it, what's the point of adding a new section to the psABI document for it? Does AArch64 psABI has a section for their counterpart?

kuanlinchentw · 2023-10-03T06:03:25Z

If no new feature is required for it, what's the point of adding a new section to the psABI document for it? Does AArch64 psABI has a section for their counterpart?

It need to add a new option for code model just like medany and differenct code generations.
Yes. AArch64 defines small, kernel, medium and large model, and there is a section about code model.

rui314 · 2023-10-03T06:18:43Z

I couldn't find a section in https://github.com/ARM-software/abi-aa/blob/844a79fd4c77252a11342709e3b27b2c9f590cf1/aaelf64/aaelf64.rst about how to use a constant pool to load an object's address from memory. Could you share the URL?

kuanlinchentw · 2023-10-03T06:25:06Z

I couldn't find a section in https://github.com/ARM-software/abi-aa/blob/844a79fd4c77252a11342709e3b27b2c9f590cf1/aaelf64/aaelf64.rst about how to use a constant pool to load an object's address from memory. Could you share the URL?

https://github.com/ARM-software/abi-aa/blob/2982a9f3b512a5bfdc9e3fea5d3b298f9165c36b/sysvabi64/sysvabi64.rst#code-models

rui314 · 2023-10-03T06:39:20Z

And which code model? It looks like the "large" code model in the AArch64 psABI is different from this proposal because the AArch64's large code model requires that GOT is within 2 GiB from the text segment and seems like addresses are read from GOT.

kuanlinchentw · 2023-10-03T06:59:22Z

And which code model? It looks like the "large" code model in the AArch64 psABI is different from this proposal because the AArch64's large code model requires that GOT is within 2 GiB from the text segment and seems like addresses are read from GOT.

I think you can find example at https://github.com/ARM-software/abi-aa/blob/2982a9f3b512a5bfdc9e3fea5d3b298f9165c36b/sysvabi64/sysvabi64.rst#get-the-address-of-a-symbol-defined-in-the-same-elf-file

I think the distance of GOT means the literal pool not normal GOT. Because it doesn't support PIC.

rui314 · 2023-10-03T07:02:24Z

If "GOT" in the documentation doesn't mean the .got section, that's super confusing, but if that's the case, that's their problem and not ours. Thank you for pointing that out.

riscv-elf.adoc

kito-cheng · 2023-12-21T08:59:10Z

Some comment from the last LLVM sync meeting:

Constant pool and long instruction sequence are both has it own use case, so we may allow both scheme and let user to choose which scheme should be used by some option, also same for function call.

Also some other comment from the last psABI call:

We didn't (officially) reserve intra-procedure-call scratch register, AArch64 has listed r16 and r17 ad IP0 and IP1, and explicitly say they may clobber during procedure call, that might be an issue when we implement range extension thunks .

However we actually already use t0, t1, t2 and t3 at PLT stuffs, so we could use same set of register to implement that, then we should specify that explicitly in the psABI, the only concern is it will seem like an incompatible ABI change, but this is less risky since it's kind of de facto behavior.

jrtc27 · 2023-12-21T16:23:16Z

Some comment from the last LLVM sync meeting:

Constant pool and long instruction sequence are both has it own use case, so we may allow both scheme and let user to choose which scheme should be used by some option, also same for function call.

Also some other comment from the last psABI call:

We didn't (officially) reserve intra-procedure-call scratch register, AArch64 has listed r16 and r17 ad IP0 and IP1, and explicitly say they may clobber during procedure call, that might be an issue when we implement range extension thunks .

However we actually already use t0, t1, t2 and t3 at PLT stuffs, so we could use same set of register to implement that, then we should specify that explicitly in the psABI, the only concern is it will seem like an incompatible ABI change, but this is less risky since it's kind of de facto behavior.

No; using custom calling conventions within an object has always been allowed (and that’s a thing that’s done across architectures), but range extension thunks clobbering registers that weren’t previously reserved for it would break that. It’s only safe to do in the PLT case because people know PLTs exist and they need to be careful.

kito-cheng · 2023-12-22T06:30:41Z

No; using custom calling conventions within an object has always been allowed (and that’s a thing that’s done across architectures), but range extension thunks clobbering registers that weren’t previously reserved for it would break that. It’s only safe to do in the PLT case because people know PLTs exist and they need to be careful.

Yeah, fair enough, so I think let moving forward without range extension thunks, then extend that later with necessary changes (e.g. adding new tag) if needed

kito-cheng

I am intend to moving this forward and then extend this further later, e.g. add long instruction sequence scheme, one concern is that will require adding new relocation and extra implementation work, so it should split into another step to do to prevent this stuck here too long.

For now, I think it would be great to add few note like: "NOTE: We intend extend the large code model with different code generation strategy in future." to mention we will add long instruction scheme in future, also range extension thunk may included in future.

kito-cheng · 2024-01-30T07:56:44Z

ping @MaskRay @rui314 , would you like to give some blessing to moving this forward?

riscv-elf.adoc

sorear · 2024-02-18T19:42:15Z

My biggest concern here is that we're allocating the name "large" and creating a compatibility promise for a short-term code model. If in the future we have a fully designed large model, gcc won't be able to switch to it for -mcmodel=large because that will regress functionality for anyone with an old binutils, so the new, better code model will be stuck with a worse name.

@kito-cheng There is a fourth option - use a real GOT. RISC-V does not have a meaningful concept of a GOT base, so there's nothing forcing the GOT to be contiguous; interleave text and GOT in 4 GiB chunks to support GOTPCREL_HI20 relocations in the large model. Obviously this won't work if you're generating a.out and need RX and RW memory to be a single contiguous range each, but it should work for ELF.

I'm a strong supporter of range extension thunks and implemented them for the riscv Go linker a while ago. Ideally we would support them with both 4-byte and 8-byte call sites, which means we need a new relocation type JAL_THUNK anyway, so adding CALL_THUNK might not be so bad.

kivoimusa · 2024-02-21T08:12:10Z

I think I can post this here for some brief:
Am running Ubuntu 20.04 L.T.S on AMD 64-bit processor and I got a compiler error when executing my linked embedded python into caffe framework. The compiler tells me to recompile with -fPIC. This causes memory relocation and I don't know why the linker and the compiler are failing to use a linked static library.
My caffe build is a large code-base of more than 32GB. The program is built from source as per the manual. I have tried to look for solutions on Stack-overflow with no success.
I would really appreciate for your assistance.
Kivoi Musa

jrtc27 · 2024-02-21T08:17:43Z

I think I can post this here for some brief: Am running Ubuntu 20.04 L.T.S on AMD 64-bit processor and I got a compiler error when executing my linked embedded python into caffe framework. The compiler tells me to recompile with -fPIC. This causes memory relocation and I don't know why the linker and the compiler are failing to use a linked static library. My caffe build is a large code-base of more than 35GB. The program is built from source as per the manual. I have tried to look for solutions on Stack-overflow with no success. I would really appreciate for your assistance. Kivoi Musa

This is the specification for the RISC-V instruction set's ABI, and your 64-bit AMD processor is not a RISC-V processor; unless you're cross-compiling for RISC-V (doubtful?) you seem quite lost and this is not the place for this kind of question since it's for a completely different processor instruction set.

Implement large code model for GlobalAddressSDNode, BlockAddressSDNode and ExternalSymbolSDNode. See discussion on riscv-non-isa/riscv-elf-psabi-doc#388. co-authored by: Kuan-Lin Chen <rufus@andestech.com>

kito-cheng · 2024-04-25T08:50:30Z

@sorear

I incline to accept current proposal with optional range extension thunk*1 support, we already have note say we may have other code generation strategies, so it let us have room to add more large code model variant in future, I am not really comfortable with the multiple GOT design, that's complicate and it would be challenge on the customized linker script to specify that.

*1 Add note to mention function call may use auipc+jalr sequence if linker support range extension thunk.

kito-cheng · 2024-07-17T06:49:28Z

Will moving forward/merge this PR after next psABI meeting, GCC already merged for a while, and LLVM also provided PoC.

kito-cheng

LGTM

Implement large code model for GlobalAddressSDNode and ExternalSymbolSDNode. See discussion on riscv-non-isa/riscv-elf-psabi-doc#388. --------- Co-authored-by: Kuan-Lin Chen <rufus@andestech.com>

With riscv-non-isa/riscv-elf-psabi-doc#388 landed it makes sense to have a define for the large code model for consistency with medany and medlow.

Implement large code model for GlobalAddressSDNode and ExternalSymbolSDNode. See discussion on riscv-non-isa/riscv-elf-psabi-doc#388. --------- Co-authored-by: Kuan-Lin Chen <rufus@andestech.com>

kito-cheng reviewed Jun 27, 2023

View reviewed changes

riscv-elf.adoc Outdated Show resolved Hide resolved

kito-cheng requested review from jrtc27 and kito-cheng June 27, 2023 02:39

kuanlinchentw force-pushed the master branch 2 times, most recently from 4c454df to a902324 Compare September 27, 2023 03:10

Add large code model information.

a902324

kuanlinchentw closed this Oct 2, 2023

kuanlinchentw reopened this Oct 2, 2023

tclin914 mentioned this pull request Oct 26, 2023

[RISCV] Support the large code model. llvm/llvm-project#70308

Merged

MaskRay reviewed Dec 21, 2023

View reviewed changes

riscv-elf.adoc Show resolved Hide resolved

kito-cheng reviewed Dec 22, 2023

View reviewed changes

kito-cheng reviewed Jan 31, 2024

View reviewed changes

riscv-elf.adoc Outdated Show resolved Hide resolved

Fix comment for large code model.

c9e7740

jrtc27 reviewed Feb 6, 2024

View reviewed changes

riscv-elf.adoc Outdated Show resolved Hide resolved

Fix the asm indent and 64-bit diretives.

af3126c

This was referenced Feb 15, 2024

riscv-elf.md: add new definitions for the compact code model #154

Closed

Define the large code model #254

Closed

sorear reviewed Feb 18, 2024

View reviewed changes

riscv-elf.adoc Outdated Show resolved Hide resolved

sorear mentioned this pull request Feb 20, 2024

Range extension thunks #425

Open

Avoid implying code and literal to be adjacent.

5396750

Add note for range extension thunk.

46fd42e

dlav-sc mentioned this pull request Jul 22, 2024

[lldb][RISCV] function calls support in lldb expressions llvm/llvm-project#99336

Merged

kito-cheng approved these changes Aug 9, 2024

View reviewed changes

kito-cheng merged commit 79fbbc8 into riscv-non-isa:master Aug 9, 2024
4 checks passed

asb mentioned this pull request Sep 10, 2024

Add __riscv_cmodel_large define for large code model riscv-non-isa/riscv-c-api-doc#86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add large code model information. #388

Add large code model information. #388

kuanlinchentw commented Jun 16, 2023

rui314 commented Sep 27, 2023 •

edited

Loading

kito-cheng commented Sep 27, 2023

rui314 commented Sep 27, 2023

jrtc27 commented Sep 27, 2023

kuanlinchentw commented Sep 28, 2023 •

edited

Loading

rui314 commented Sep 28, 2023

MaskRay commented Sep 28, 2023

aswaterman commented Sep 28, 2023

kuanlinchentw commented Oct 2, 2023 •

edited

Loading

jrtc27 commented Oct 2, 2023

kuanlinchentw commented Oct 2, 2023

rui314 commented Oct 2, 2023

kito-cheng commented Oct 2, 2023 •

edited

Loading

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023 •

edited

Loading

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023

kito-cheng commented Dec 21, 2023

jrtc27 commented Dec 21, 2023

kito-cheng commented Dec 22, 2023

kito-cheng left a comment

kito-cheng commented Jan 30, 2024

sorear commented Feb 18, 2024

kivoimusa commented Feb 21, 2024 •

edited

Loading

jrtc27 commented Feb 21, 2024

kito-cheng commented Apr 25, 2024

kito-cheng commented Jul 17, 2024

kito-cheng left a comment

Add large code model information. #388

Add large code model information. #388

Conversation

kuanlinchentw commented Jun 16, 2023

rui314 commented Sep 27, 2023 • edited Loading

kito-cheng commented Sep 27, 2023

rui314 commented Sep 27, 2023

jrtc27 commented Sep 27, 2023

kuanlinchentw commented Sep 28, 2023 • edited Loading

rui314 commented Sep 28, 2023

MaskRay commented Sep 28, 2023

aswaterman commented Sep 28, 2023

kuanlinchentw commented Oct 2, 2023 • edited Loading

jrtc27 commented Oct 2, 2023

kuanlinchentw commented Oct 2, 2023

rui314 commented Oct 2, 2023

kito-cheng commented Oct 2, 2023 • edited Loading

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023 • edited Loading

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023

kuanlinchentw commented Oct 3, 2023

rui314 commented Oct 3, 2023

kito-cheng commented Dec 21, 2023

jrtc27 commented Dec 21, 2023

kito-cheng commented Dec 22, 2023

kito-cheng left a comment

Choose a reason for hiding this comment

kito-cheng commented Jan 30, 2024

sorear commented Feb 18, 2024

kivoimusa commented Feb 21, 2024 • edited Loading

jrtc27 commented Feb 21, 2024

kito-cheng commented Apr 25, 2024

kito-cheng commented Jul 17, 2024

kito-cheng left a comment

Choose a reason for hiding this comment

rui314 commented Sep 27, 2023 •

edited

Loading

kuanlinchentw commented Sep 28, 2023 •

edited

Loading

kuanlinchentw commented Oct 2, 2023 •

edited

Loading

kito-cheng commented Oct 2, 2023 •

edited

Loading

rui314 commented Oct 3, 2023 •

edited

Loading

kivoimusa commented Feb 21, 2024 •

edited

Loading