Skip to content

Commit

Permalink
feat(esp32c3): implement mem{set,cpy} in SRAM
Browse files Browse the repository at this point in the history
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
  • Loading branch information
sethp committed May 14, 2023
1 parent 2d957e4 commit dd137fd
Showing 1 changed file with 40 additions and 0 deletions.
40 changes: 40 additions & 0 deletions esp32c3-hal/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,46 @@ pub mod analog {
pub use esp_hal_common::analog::{AvailableAnalog, SarAdcExt};
}

mod mem {
#[no_mangle]
#[link_section = ".rwtext"]
pub unsafe extern "C" fn memcpy(dest: *mut u8, src: *const u8, n: usize) -> *mut u8 {
let r = dest;
let (n, m) = (n / 4, n % 4);
for i in 0..m {
*dest.add(i) = *src.add(i);
}
let dest = dest.add(m).cast::<usize>();
let src = src.add(m).cast::<usize>();
for i in 0..n {
*dest.add(i) = *src.add(i);
}
r
}

#[no_mangle]
#[link_section = ".rwtext"]
pub unsafe extern "C" fn memset(
p: *mut u8,
c: i32, // equivalent to a c int
n: usize,
) -> *mut u8 {
let s = p;
let (n, m) = (n / 4, n % 4);
let b = c as u8;
for i in 0..m {
*p.add(i) = b
}
let p = p.add(m).cast::<usize>();

let w = usize::from_ne_bytes([b; 4]);
for i in 0..n {
*p.add(i) = w;
}
s
}
}

extern "C" {
cfg_if::cfg_if! {
if #[cfg(feature = "mcu-boot")] {
Expand Down

0 comments on commit dd137fd

Please sign in to comment.