Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicore support for RISC-V #11418

Merged
merged 9 commits into from
Nov 3, 2022
Merged

Multicore support for RISC-V #11418

merged 9 commits into from
Nov 3, 2022

Conversation

nojb
Copy link
Contributor

@nojb nojb commented Jul 8, 2022

This PR adds Multicore support to the RISC-V backend, closely following what was done for Arm64. The memory model is implemented by the following sequences:

OCaml operation RISC-V operation
Atomic load fence iorw, iorw; ld; fence iorw, iorw
Atomic store fence iorw, ow; amoswap.d.aq
Nonatomic load ld
Nonatomic store sd

The first two sequences are what gcc emits for atomic_load and atomic_store. Perhaps this is too naïve, the experts will correct me :) For reference: the RISC-V memory model is called RVWMO (RISC-V Weak Memory Ordering), and an overview of it can be found in the official spec, Chapter 17.

Beyond that, the PR consists in closely porting over the Arm64 changes to RISC-V. Some remarks (in no particular order):

  • The amoswap.d.aq instruction requires trivial addressing in the memory operand, this is enforced in asmcomp/riscv/selection.ml.
  • The backend does not use frame pointers, but it requires the stack to be 16-aligned (same as Arm64), so we define Pop_frame_pointer macro in runtime/caml/stack.h in the same way as in Arm64 so that the same stack walking logic can go through.
  • An extra register t0 is reserved for the runtime/code generator; perhaps it could have been possible to do some assembly gymnastics to keep t0 available for the register allocator, but it would have complicated matters and it would have increased the diff between the RISC-V and Arm64 backends.
  • When calling a "noalloc" external, the OCaml stack pointer needs to be saved in a callee-save register; s0 is used for this purpose (and so is marked as "destroyed" for nonalloc calls, in Proc.destroyed_at_c_noalloc_call).
  • Support for "short" allocation sequences (caml_allocN) was added.
  • Some easy CFI directives were added following the Arm64 backend, but they are untested. In any case full CFI support is left for later.
  • For testing, the virtual machine at INRIA's CI could be used, but I cannot access it anymore; perhaps someone who can could point it to this branch? For my part, I tested this on real hardware graciously made available by Tarides/Cambridge Computer Lab. The full testsuite passed on that hardware.

Assigning to @kayceesrk who expressed interest in looking at this PR, but as usual any and all feedback is warmly welcome.

let lbl_call_gc = new_label () in
let n = -bytes in
let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in
if is_immediate n then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use emit_addimm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. Fixed.

@kayceesrk kayceesrk self-assigned this Jul 9, 2022
@kayceesrk
Copy link
Contributor

Thanks for the PR @nojb. It will likely take me 2 weeks before I can start reviewing. Busy with teaching. Hope that's ok.

@ctk21 would also be a good person for this if he has some spare cycles.

@nojb
Copy link
Contributor Author

nojb commented Jul 12, 2022

Thanks for the PR @nojb. It will likely take me 2 weeks before I can start reviewing. Busy with teaching. Hope that's ok.

Sure, no worries, there's no hurry :)

@nojb
Copy link
Contributor Author

nojb commented Sep 25, 2022

Friendly ping.

@kayceesrk
Copy link
Contributor

Hi @nojb, this is next on my list.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 3, 2022

I've started reviewing the PR (finally). I have managed to build the branch on riscv64/ubuntu docker image. The testsuite is currently running. Given the comprehensive nature of the effect handler testsuite, if the testsuite passes then I am fairly confident that the implementation does the right thing for effect handlers.

I shall go over the memory model compilation first.

[Update]

The tests have now run to completion and the only failures do not seem to be connected to this PR but to my execution environment:

List of failed tests:
    tests/lib-unix/unix-socket/'recvfrom_linux.ml' with 1.1.2 (native) 
    tests/lib-unix/unix-socket/'recvfrom_unix.ml' with 1.1.2 (native) 
    tests/lib-unix/unix-socket/'recvfrom_unix.ml' with 1.1.1 (bytecode) 
    tests/lib-unix/unix-socket/'recvfrom_linux.ml' with 1.1.1 (bytecode) 

@kayceesrk
Copy link
Contributor

There is some discrepancy between compilation of the memory model mentioned in the original PR message and the implementation. The compilation mentioned in the original PR message is not completely right, but the implementation seems to do it more correctly, but not fully.

Non-atomic loads ✅

The compilation of non-atomic loads is just a plain load, and this looks correct.

Non-atomic stores ❌

The description above says that non-atomic stores are compiled to plain stores (sd). This looks wrong if you compare this to the Arm64 compilation. In fact, the implementation more closely follows Arm64 backend.

Firstly, we need to discriminate between the following:

  1. Initialising writes
  2. non-initialising (non-atomic) writes of integers, where no publication of new objects can take place
  3. non-initialising (non-atomic) writes of pointers, with possible publication (caml_modify)
  4. writes to non-word-sized fields

For reference, this classification is similar to the one made for Arm64: #10972 (comment)

(1) ✅ should be compiled to plan stores sd. This looks correct.

(2) ❌ needs to ensure that prior loads are ordered before the store. This is the reason for dmb ishld; str on Arm64. See https://github.com/ocaml/ocaml/blob/trunk/asmcomp/arm64/emit.mlp#L816-L817.

The current code sequence

fence iorw, ow
amoswap.d.aq x0, {emit_reg i.arg.(0)}, {emit_int ofs}({emit_reg i.arg.(1)})

at https://github.com/ocaml/ocaml/pull/11418/files#diff-c70b7c91c9069b1c55affc92395ef1abfe6f21442c81f845b65ac19a3141fae5R371 is too strong.

Table A.4 from the RISC-V spec provides the mapping from Arm to RISC-V (reproduced below).

image

According to this, the instruction sequence for (2) would be

fence r, rw
sd {emit_reg i.arg.(0)}, {emit_int ofs}({emit_reg i.arg.(1)})

In fact, dmb ishld is stronger than what we require. The memory model only needs to order prior loads from reordered after the stores. dmb ishld and fence r, rw additionally enforce load-load ordering. Hence, the sequence could be optimized as

fence r, w
sd {emit_reg i.arg.(0)}, {emit_int ofs}({emit_reg i.arg.(1)})

(3) ✅ is taken care of by caml_modify. No additional work is necessary here.

(4) ✅ these operations do not follow the memory model and nothing to be done here.

References

@nojb
Copy link
Contributor Author

nojb commented Nov 3, 2022

Hence, the sequence could be optimized as

Thanks for the clear explanation @kayceesrk! I agree with it and have amended in consequence. I guess loads also need an adjustment along similar lines?

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 3, 2022

The atomic load also can be simplified a bit.

Atomic load

Currently, the atomic load is

fence iorw, iorw
ld
fence iorw, iorw

If you look at what clang generates for the C version of atomic load (for reference, see "Optimised mapping of operations to C" in #10995)

atomic_thread_fence(memory_order_acquire);
atomic_load_explicit(..., memory_order_seq_cst);

it is

(1) fence r, rw /* for the acquire fence */
(2) fence rw, rw
(3) ld
(4) fence r, rw

See https://godbolt.org/z/cEjf95WzM.

The fence fence rw, rw subsumes fence r, rw. Hence, the optimized instruction sequence will be

fence rw, rw
ld 
fence r, rw

Atomic store

Nothing to be done here as it goes through the C stub caml_atomic_exchange.

@@ -243,10 +253,9 @@ let emit_instr env i =
assert (env.f.fun_prologue_required);
let n = frame_size env in
emit_stack_adjustment (-n);
if n > 0 then cfi_adjust_cfa_offset n;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for removing this line? It seems to do the right thing, even if CFI directives aren't fully right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. The line has been moved inside emit_stack_adjustment (as it was missing from other places where this function was being used to adjust the stack pointer).

| Lop(Iload { memory_chunk = Word_int | Word_val; addressing_mode = Iindexed ofs; is_atomic } ) ->
if is_atomic then
` fence rw, rw\n`;
` ld {emit_reg i.res.(0)}, {emit_int ofs}({emit_reg i.arg.(0)})\n`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to use a TAB instead of SPACE between fence and rw? Fixed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pointing out that the backticks are not left-aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the existing backend code, it seems that the backticks are aligned as normal OCaml code, so indented under an if. But I agree it looks a bit odd. Instead I put it on the same line as the if, similar to

if assignment then ` dmb ishld\n`;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was reading the code wrong 😑 . The fix looks good.

@@ -38,8 +38,7 @@ let word_addressed = false
s2-s9 8-15 arguments/results (preserved by C)
t2-t6 16-20 temporary
s0 21 general purpose (preserved by C)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't s0 for the OCaml stack now and not fully general purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Anil, as far as I remember, we use s0 in only one spot as an auxiliary register to preserve the OCaml stack pointer (here: https://github.com/nojb/ocaml/blob/71b9ebae167c5574b25d18375df5c45715a0d787/asmcomp/riscv/emit.mlp#L326-L335) and to do that we mark s0 as being killed by noalloc calls (here: https://github.com/nojb/ocaml/blob/71b9ebae167c5574b25d18375df5c45715a0d787/asmcomp/riscv/proc.ml#L244-L249). But as far as the register allocator is concerned I think that s0 is still treated as a general purpose.

Perhaps some trick could be used to avoid the need to reserve s0 in this way, but I couldn't figure it out so far...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, quite right, thanks.

Array.of_list(List.map phys_reg
[0; 1; 2; 3; 4; 5; 6; 7; 8; 16; 17; 18; 19; 20; 22;
[0; 1; 2; 3; 4; 5; 6; 7; 16; 17; 18; 19; 20; 21 (* s0 *);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, could you tell us why 8 and 22 are gone and 21 added? A reply to this question here is fine. No need to add comments in the code.

Copy link
Contributor Author

@nojb nojb Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 8 (= s2) is removed: the comment in trunk indicates that this register was included in this list because it is clobbered by caml_c_call. However this seems like a mistake, because this list specifies those registers killed by noalloc calls (basically caller-save ones), which do not go through caml_c_call at all. So it looks like 8 did not need to be here at all to begin with.
  • 21 (= s0) is added: this is explained in the discussion [above].(Multicore support for RISC-V #11418 (comment))
  • 22 (= t0) is removed: this one is now used in the runtime stubs as a general auxiliary register, so it is removed from the set of allocatable registers altogether (realized by lowering num_available_registers.(0) from 23 to 22).

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me.

@nojb
Copy link
Contributor Author

nojb commented Nov 3, 2022

The PR looks good to me.

Thanks! I am updating Changes, and am wondering if I should move it to the 5.1 section :) @Octachron?

@avsm
Copy link
Member

avsm commented Nov 3, 2022

am wondering if I should move it to the 5.1 section :)

This is just fixing a bug in 5.0.0~beta1 isn't it? :-)

@Octachron
Copy link
Member

Yes, a new port should be moved to the next version at this point of the release.
We could possibly discuss adding it to the possible midpoint 5.0.1 release.

@nojb
Copy link
Contributor Author

nojb commented Nov 3, 2022

Yes, a new port should be moved to the next version at this point of the release. We could possibly discuss adding it to the possible midpoint 5.0.1 release.

OK! Moved to "Working version" section.

Planning to merge once CI passes. Thanks @kayceesrk for the review!

@nojb nojb merged commit 9d5c55c into ocaml:trunk Nov 3, 2022
@nojb nojb deleted the riscv_5 branch November 3, 2022 17:51
@gasche
Copy link
Member

gasche commented Nov 7, 2022

cpu_relax is defined per-architecture in a hidden corner of platform.h:

/* Hint for busy-waiting loops */
Caml_inline void cpu_relax() {
#ifdef __GNUC__
#if defined(__x86_64__) || defined(__i386__)
asm volatile("pause" ::: "memory");
#elif defined(__aarch64__)
asm volatile ("yield" ::: "memory");
#else
/* Just a compiler barrier */
asm volatile ("" ::: "memory");
#endif
#endif
}

No RISCV definition has been added. Is it because the fallback is adequate on RISCV, or because we forgot to extend this with a RISCV version?

(I wish it was easier to tell. Idle thoughts:

  • Instead of a fallback, we could have a static error if no definition has been found for the architecture, at least in native mode
  • At least, if we keep the fallback, the comment should explicitly list which architectures have been considered for the fallback, so that we can tell after the fact if an omission is intentional or not. In particular, if the fallback is right for RISCV, we should add a comment to say so.)

@gasche
Copy link
Member

gasche commented Nov 7, 2022

@nojb
Copy link
Contributor Author

nojb commented Nov 7, 2022

No RISCV definition has been added. Is it because the fallback is adequate on RISCV, or because we forgot to extend this with a RISCV version?

We forgot to extend this with a RISC-V version.

Relevant: RISCV isa manual: 'Zihintpause' Pause Hint, version 2.0

Indeed, the PAUSE instruction seems to be the corresponding thing in RISC-V land, see p. 39 https://github.com/riscv/riscv-isa-manual/releases/download/draft-20221004-28b46de/riscv-spec.pdf if you prefer PDF.

@kayceesrk
Copy link
Contributor

Agree with extending cpu_relax with RISC-V PAUSE instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants