Multicore support for RISC-V #11418

nojb · 2022-07-08T16:00:12Z

This PR adds Multicore support to the RISC-V backend, closely following what was done for Arm64. The memory model is implemented by the following sequences:

OCaml operation	RISC-V operation
Atomic load	`fence iorw, iorw; ld; fence iorw, iorw`
Atomic store	`fence iorw, ow; amoswap.d.aq`
Nonatomic load	`ld`
Nonatomic store	`sd`

The first two sequences are what gcc emits for atomic_load and atomic_store. Perhaps this is too naïve, the experts will correct me :) For reference: the RISC-V memory model is called RVWMO (RISC-V Weak Memory Ordering), and an overview of it can be found in the official spec, Chapter 17.

Beyond that, the PR consists in closely porting over the Arm64 changes to RISC-V. Some remarks (in no particular order):

The amoswap.d.aq instruction requires trivial addressing in the memory operand, this is enforced in asmcomp/riscv/selection.ml.
The backend does not use frame pointers, but it requires the stack to be 16-aligned (same as Arm64), so we define Pop_frame_pointer macro in runtime/caml/stack.h in the same way as in Arm64 so that the same stack walking logic can go through.
An extra register t0 is reserved for the runtime/code generator; perhaps it could have been possible to do some assembly gymnastics to keep t0 available for the register allocator, but it would have complicated matters and it would have increased the diff between the RISC-V and Arm64 backends.
When calling a "noalloc" external, the OCaml stack pointer needs to be saved in a callee-save register; s0 is used for this purpose (and so is marked as "destroyed" for nonalloc calls, in Proc.destroyed_at_c_noalloc_call).
Support for "short" allocation sequences (caml_allocN) was added.
Some easy CFI directives were added following the Arm64 backend, but they are untested. In any case full CFI support is left for later.
For testing, the virtual machine at INRIA's CI could be used, but I cannot access it anymore; perhaps someone who can could point it to this branch? For my part, I tested this on real hardware graciously made available by Tarides/Cambridge Computer Lab. The full testsuite passed on that hardware.

Assigning to @kayceesrk who expressed interest in looking at this PR, but as usual any and all feedback is warmly welcome.

smuenzel · 2022-07-09T00:38:41Z

asmcomp/riscv/emit.mlp

+        let lbl_call_gc = new_label () in
+        let n = -bytes in
+        let offset = Domainstate.(idx_of_field Domain_young_limit) * 8 in
+        if is_immediate n then


Should this use emit_addimm?

Yes, thanks. Fixed.

kayceesrk · 2022-07-12T03:19:16Z

Thanks for the PR @nojb. It will likely take me 2 weeks before I can start reviewing. Busy with teaching. Hope that's ok.

@ctk21 would also be a good person for this if he has some spare cycles.

nojb · 2022-07-12T06:18:46Z

Thanks for the PR @nojb. It will likely take me 2 weeks before I can start reviewing. Busy with teaching. Hope that's ok.

Sure, no worries, there's no hurry :)

nojb · 2022-09-25T14:22:32Z

Friendly ping.

kayceesrk · 2022-09-28T04:31:26Z

Hi @nojb, this is next on my list.

kayceesrk · 2022-11-03T05:19:57Z

I've started reviewing the PR (finally). I have managed to build the branch on riscv64/ubuntu docker image. The testsuite is currently running. Given the comprehensive nature of the effect handler testsuite, if the testsuite passes then I am fairly confident that the implementation does the right thing for effect handlers.

I shall go over the memory model compilation first.

[Update]

The tests have now run to completion and the only failures do not seem to be connected to this PR but to my execution environment:

List of failed tests:
    tests/lib-unix/unix-socket/'recvfrom_linux.ml' with 1.1.2 (native) 
    tests/lib-unix/unix-socket/'recvfrom_unix.ml' with 1.1.2 (native) 
    tests/lib-unix/unix-socket/'recvfrom_unix.ml' with 1.1.1 (bytecode) 
    tests/lib-unix/unix-socket/'recvfrom_linux.ml' with 1.1.1 (bytecode)

kayceesrk · 2022-11-03T07:09:00Z

There is some discrepancy between compilation of the memory model mentioned in the original PR message and the implementation. The compilation mentioned in the original PR message is not completely right, but the implementation seems to do it more correctly, but not fully.

Non-atomic loads ✅

The compilation of non-atomic loads is just a plain load, and this looks correct.

Non-atomic stores ❌

The description above says that non-atomic stores are compiled to plain stores (sd). This looks wrong if you compare this to the Arm64 compilation. In fact, the implementation more closely follows Arm64 backend.

Firstly, we need to discriminate between the following:

Initialising writes
non-initialising (non-atomic) writes of integers, where no publication of new objects can take place
non-initialising (non-atomic) writes of pointers, with possible publication (caml_modify)
writes to non-word-sized fields

For reference, this classification is similar to the one made for Arm64: #10972 (comment)

(1) ✅ should be compiled to plan stores sd. This looks correct.

(2) ❌ needs to ensure that prior loads are ordered before the store. This is the reason for dmb ishld; str on Arm64. See https://github.com/ocaml/ocaml/blob/trunk/asmcomp/arm64/emit.mlp#L816-L817.

The current code sequence

fence iorw, ow
amoswap.d.aq x0, {emit_reg i.arg.(0)}, {emit_int ofs}({emit_reg i.arg.(1)})

at https://github.com/ocaml/ocaml/pull/11418/files#diff-c70b7c91c9069b1c55affc92395ef1abfe6f21442c81f845b65ac19a3141fae5R371 is too strong.

Table A.4 from the RISC-V spec provides the mapping from Arm to RISC-V (reproduced below).

According to this, the instruction sequence for (2) would be

fence r, rw
sd {emit_reg i.arg.(0)}, {emit_int ofs}({emit_reg i.arg.(1)})

In fact, dmb ishld is stronger than what we require. The memory model only needs to order prior loads from reordered after the stores. dmb ishld and fence r, rw additionally enforce load-load ordering. Hence, the sequence could be optimized as

fence r, w
sd {emit_reg i.arg.(0)}, {emit_int ofs}({emit_reg i.arg.(1)})

(3) ✅ is taken care of by caml_modify. No additional work is necessary here.

(4) ✅ these operations do not follow the memory model and nothing to be done here.

References

Explain memory mapping between OCaml memory model and C -- Explain mapping between OCaml memory model and C #10995

nojb · 2022-11-03T08:15:44Z

Hence, the sequence could be optimized as

Thanks for the clear explanation @kayceesrk! I agree with it and have amended in consequence. I guess loads also need an adjustment along similar lines?

kayceesrk · 2022-11-03T08:34:03Z

The atomic load also can be simplified a bit.

Atomic load

Currently, the atomic load is

fence iorw, iorw
ld
fence iorw, iorw

If you look at what clang generates for the C version of atomic load (for reference, see "Optimised mapping of operations to C" in #10995)

atomic_thread_fence(memory_order_acquire);
atomic_load_explicit(..., memory_order_seq_cst);

it is

(1) fence r, rw /* for the acquire fence */
(2) fence rw, rw
(3) ld
(4) fence r, rw

See https://godbolt.org/z/cEjf95WzM.

The fence fence rw, rw subsumes fence r, rw. Hence, the optimized instruction sequence will be

fence rw, rw
ld 
fence r, rw

Atomic store

Nothing to be done here as it goes through the C stub caml_atomic_exchange.

kayceesrk · 2022-11-03T08:48:29Z

asmcomp/riscv/emit.mlp

@@ -243,10 +253,9 @@ let emit_instr env i =
      assert (env.f.fun_prologue_required);
      let n = frame_size env in
      emit_stack_adjustment (-n);
-      if n > 0 then cfi_adjust_cfa_offset n;


Is there a reason for removing this line? It seems to do the right thing, even if CFI directives aren't fully right.

Indeed. The line has been moved inside emit_stack_adjustment (as it was missing from other places where this function was being used to adjust the stack pointer).

kayceesrk · 2022-11-03T08:55:55Z

asmcomp/riscv/emit.mlp

+  | Lop(Iload { memory_chunk = Word_int | Word_val; addressing_mode = Iindexed ofs; is_atomic } ) ->
+      if is_atomic then
+        `	fence rw, rw\n`;
+      `	ld	{emit_reg i.res.(0)}, {emit_int ofs}({emit_reg i.arg.(0)})\n`;


nit: alignment.

You mean to use a TAB instead of SPACE between fence and rw? Fixed!

I was pointing out that the backticks are not left-aligned.

Looking at the existing backend code, it seems that the backticks are aligned as normal OCaml code, so indented under an if. But I agree it looks a bit odd. Instead I put it on the same line as the if, similar to

ocaml/asmcomp/arm64/emit.mlp

Line 816 in 31ba6ed

if assignment then ` dmb ishld\n`;

Thanks. I was reading the code wrong 😑 . The fix looks good.

avsm · 2022-11-03T09:26:59Z

asmcomp/riscv/proc.ml

@@ -38,8 +38,7 @@ let word_addressed = false
    s2-s9        8-15      arguments/results (preserved by C)
    t2-t6        16-20     temporary
    s0           21        general purpose (preserved by C)


isn't s0 for the OCaml stack now and not fully general purpose?

Hi Anil, as far as I remember, we use s0 in only one spot as an auxiliary register to preserve the OCaml stack pointer (here: https://github.com/nojb/ocaml/blob/71b9ebae167c5574b25d18375df5c45715a0d787/asmcomp/riscv/emit.mlp#L326-L335) and to do that we mark s0 as being killed by noalloc calls (here: https://github.com/nojb/ocaml/blob/71b9ebae167c5574b25d18375df5c45715a0d787/asmcomp/riscv/proc.ml#L244-L249). But as far as the register allocator is concerned I think that s0 is still treated as a general purpose.

Perhaps some trick could be used to avoid the need to reserve s0 in this way, but I couldn't figure it out so far...

Ah yes, quite right, thanks.

kayceesrk · 2022-11-03T10:14:18Z

asmcomp/riscv/proc.ml

  Array.of_list(List.map phys_reg
-    [0; 1; 2; 3; 4; 5; 6; 7; 8; 16; 17; 18; 19; 20; 22;
+    [0; 1; 2; 3; 4; 5; 6; 7; 16; 17; 18; 19; 20; 21 (* s0 *);


For posterity, could you tell us why 8 and 22 are gone and 21 added? A reply to this question here is fine. No need to add comments in the code.

8 (= s2) is removed: the comment in trunk indicates that this register was included in this list because it is clobbered by caml_c_call. However this seems like a mistake, because this list specifies those registers killed by noalloc calls (basically caller-save ones), which do not go through caml_c_call at all. So it looks like 8 did not need to be here at all to begin with.

21 (= s0) is added: this is explained in the discussion [above].(Multicore support for RISC-V #11418 (comment))

22 (= t0) is removed: this one is now used in the runtime stubs as a general auxiliary register, so it is removed from the set of allocatable registers altogether (realized by lowering num_available_registers.(0) from 23 to 22).

kayceesrk

The PR looks good to me.

nojb · 2022-11-03T12:07:24Z

The PR looks good to me.

Thanks! I am updating Changes, and am wondering if I should move it to the 5.1 section :) @Octachron?

avsm · 2022-11-03T16:42:46Z

am wondering if I should move it to the 5.1 section :)

This is just fixing a bug in 5.0.0~beta1 isn't it? :-)

Octachron · 2022-11-03T16:57:17Z

Yes, a new port should be moved to the next version at this point of the release.
We could possibly discuss adding it to the possible midpoint 5.0.1 release.

nojb · 2022-11-03T17:02:38Z

Yes, a new port should be moved to the next version at this point of the release. We could possibly discuss adding it to the possible midpoint 5.0.1 release.

OK! Moved to "Working version" section.

Planning to merge once CI passes. Thanks @kayceesrk for the review!

gasche · 2022-11-07T08:33:32Z

cpu_relax is defined per-architecture in a hidden corner of platform.h:

ocaml/runtime/caml/platform.h

Lines 34 to 47 in 762014f

    
           /* Hint for busy-waiting loops */ 
        
           Caml_inline void cpu_relax() { 
        
           #ifdef __GNUC__ 
        
           #if defined(__x86_64__) || defined(__i386__) 
        
             asm volatile("pause" ::: "memory"); 
        
           #elif defined(__aarch64__) 
        
             asm volatile ("yield" ::: "memory"); 
        
           #else 
        
             /* Just a compiler barrier */ 
        
             asm volatile ("" ::: "memory"); 
        
           #endif 
        
           #endif 
        
           }

No RISCV definition has been added. Is it because the fallback is adequate on RISCV, or because we forgot to extend this with a RISCV version?

(I wish it was easier to tell. Idle thoughts:

Instead of a fallback, we could have a static error if no definition has been found for the architecture, at least in native mode
At least, if we keep the fallback, the comment should explicitly list which architectures have been considered for the fallback, so that we can tell after the fact if an omission is intentional or not. In particular, if the fallback is right for RISCV, we should add a comment to say so.)

gasche · 2022-11-07T08:40:34Z

Relevant: RISCV isa manual: 'Zihintpause' Pause Hint, version 2.0.

nojb · 2022-11-07T08:44:25Z

No RISCV definition has been added. Is it because the fallback is adequate on RISCV, or because we forgot to extend this with a RISCV version?

We forgot to extend this with a RISC-V version.

Relevant: RISCV isa manual: 'Zihintpause' Pause Hint, version 2.0

Indeed, the PAUSE instruction seems to be the corresponding thing in RISC-V land, see p. 39 https://github.com/riscv/riscv-isa-manual/releases/download/draft-20221004-28b46de/riscv-spec.pdf if you prefer PDF.

kayceesrk · 2022-11-07T08:57:03Z

Agree with extending cpu_relax with RISC-V PAUSE instruction.

nojb requested a review from kayceesrk July 8, 2022 16:00

nojb force-pushed the riscv_5 branch from 8af44ad to ce13623 Compare July 8, 2022 18:20

smuenzel reviewed Jul 9, 2022

View reviewed changes

kayceesrk self-assigned this Jul 9, 2022

nojb mentioned this pull request Jul 20, 2022

Name mangling: use . instead of __ as module separator #11430

Merged

nojb force-pushed the riscv_5 branch from a4b022b to 0451052 Compare July 21, 2022 20:47

nojb added 2 commits July 28, 2022 09:21

Add Multicore support to RISC-V backend

78731e6

Changes

524315b

nojb force-pushed the riscv_5 branch from 0451052 to 524315b Compare July 28, 2022 09:21

nojb force-pushed the riscv_5 branch from a0401e7 to af805eb Compare November 3, 2022 08:08

Merge remote-tracking branch 'upstream/trunk' into riscv_5

e535abc

nojb force-pushed the riscv_5 branch from af805eb to c16d7d1 Compare November 3, 2022 08:12

nojb force-pushed the riscv_5 branch from db2e2c9 to 6967402 Compare November 3, 2022 08:41

nojb added 2 commits November 3, 2022 09:41

non-atomic stores: Use 'fence r,w; sd'

5888e09

Simplify atomic loads

f3dc417

nojb force-pushed the riscv_5 branch from 6967402 to f3dc417 Compare November 3, 2022 08:42

kayceesrk reviewed Nov 3, 2022

View reviewed changes

nojb added 2 commits November 3, 2022 09:58

Use tab

9fb280f

Style

71b9eba

avsm reviewed Nov 3, 2022

View reviewed changes

kayceesrk reviewed Nov 3, 2022

View reviewed changes

kayceesrk approved these changes Nov 3, 2022

View reviewed changes

Changes

a9162e4

Changes

35bfa84

nojb merged commit 9d5c55c into ocaml:trunk Nov 3, 2022

nojb deleted the riscv_5 branch November 3, 2022 17:51

nojb mentioned this pull request Nov 7, 2022

Add definition of cpu_relax for RISC-V #11708

Merged

kayceesrk mentioned this pull request Dec 23, 2022

[Trial PR] Add ThreadSanitizer support ocaml-multicore/ocaml-tsan#22

Closed

hack3ric pushed a commit to hack3ric/ocaml that referenced this pull request Jun 3, 2023

Multicore support for RISC-V (ocaml#11418)

52611d2

tmcgilchrist mentioned this pull request Jun 7, 2023

Update the POWER port for OCaml 5 #12276

Merged

jmid mentioned this pull request Jul 11, 2023

Correct the bytecode architectures for OCaml 5.1 and OCaml 5.2 (trunk) ocaml/opam-repository#24075

Merged

Jingwiw mentioned this pull request Feb 1, 2024

为 Ocaml backport RISC-V 的多核支持 openeuler-riscv/oerv-team#125

Open

kayceesrk added the memory-model label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multicore support for RISC-V #11418

Multicore support for RISC-V #11418

nojb commented Jul 8, 2022

smuenzel Jul 9, 2022

nojb Jul 9, 2022

kayceesrk commented Jul 12, 2022

nojb commented Jul 12, 2022

nojb commented Sep 25, 2022

kayceesrk commented Sep 28, 2022

kayceesrk commented Nov 3, 2022 •

edited

Loading

kayceesrk commented Nov 3, 2022

nojb commented Nov 3, 2022

kayceesrk commented Nov 3, 2022 •

edited

Loading

kayceesrk Nov 3, 2022

nojb Nov 3, 2022

kayceesrk Nov 3, 2022

nojb Nov 3, 2022

kayceesrk Nov 3, 2022

nojb Nov 3, 2022

kayceesrk Nov 3, 2022

avsm Nov 3, 2022

nojb Nov 3, 2022

avsm Nov 3, 2022

kayceesrk Nov 3, 2022

nojb Nov 3, 2022 •

edited

Loading

kayceesrk left a comment

nojb commented Nov 3, 2022

avsm commented Nov 3, 2022

Octachron commented Nov 3, 2022

nojb commented Nov 3, 2022

gasche commented Nov 7, 2022

gasche commented Nov 7, 2022 •

edited

Loading

nojb commented Nov 7, 2022

kayceesrk commented Nov 7, 2022

Multicore support for RISC-V #11418

Multicore support for RISC-V #11418

Conversation

nojb commented Jul 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kayceesrk commented Jul 12, 2022

nojb commented Jul 12, 2022

nojb commented Sep 25, 2022

kayceesrk commented Sep 28, 2022

kayceesrk commented Nov 3, 2022 • edited Loading

kayceesrk commented Nov 3, 2022

Non-atomic loads ✅

Non-atomic stores ❌

References

nojb commented Nov 3, 2022

kayceesrk commented Nov 3, 2022 • edited Loading

Atomic load

Atomic store

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nojb Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

kayceesrk left a comment

Choose a reason for hiding this comment

nojb commented Nov 3, 2022

avsm commented Nov 3, 2022

Octachron commented Nov 3, 2022

nojb commented Nov 3, 2022

gasche commented Nov 7, 2022

gasche commented Nov 7, 2022 • edited Loading

nojb commented Nov 7, 2022

kayceesrk commented Nov 7, 2022

kayceesrk commented Nov 3, 2022 •

edited

Loading

kayceesrk commented Nov 3, 2022 •

edited

Loading

nojb Nov 3, 2022 •

edited

Loading

gasche commented Nov 7, 2022 •

edited

Loading