Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534

Closed
wants to merge 9 commits into from

Conversation

sethp
Copy link
Contributor

@sethp sethp commented May 14, 2023

As mentioned over on #448 the current RISC-V HAL interrupt flow isn't memory-resident, with large portions of it living in Flash.

This change set is not intended to be merged as-is, or probably at all: it's more of a "spike" or "steel thread" demonstrating the areas of work I've been able to identify in order to meet the goal of "[an] interrupt request is serviced without touching flash memory." My friend & I needed to do this because we have a hard-real-time interrupt that fires at ~480Hz (60Hz * 8), which made for some 🌶️ contention at the cache layer.

As usual with these things, a lot of the wins can be had for relatively little effort. I've tried to structure these changes in increasing order from least implementation/maintenance burden to most, with each commit message giving a detailed explanation of what problem I believed I was solving at each step. Happily, they're also largely in sync with the size of the wins (with the exception of the mem* friends, discussed later).

What I'm hoping for from this discussion is mostly just to share my findings, and to identify where we might already have consensus on what the next steps ought to be (where any). In an effort to apply Cunningham's Law, my current beliefs about the disposition and direction of these changes are below.

ready for (early) review

  • fa671bb is something I intend to replicate in all the other RISC-V linker scripts, at which point that change will be ready to be merged into the mainline HAL. feat(riscv): relocate .trap machinery to RAM #541
  • cda36dd ought to be carried through to complete the conversion from const slices to inline-able functions. I don't yet have any 'c6s to try out the PLIC side of things, though I did order some for that purpose.
  • e29f3d5 (which is really rustbox/esp-pacs@c8e7cbb) is worth pursuing, which would involve:
    • making a change to the esp-pacs generator relocating that object to .trap.rodata (my preference), or .data
    • (optionally) if using .trap.rodata, making a change at this point to the linker scripts that live here in esp-hal to emit objects from that section
    • the release train leaves the station, I guess? it's not a breaking change if we can ensure the newer pacs are only ever used with the newer linker scripts, so there's maybe a phrasing of cargo constraints here that allows this to be a patch release. Or, we use .data and it's definitely not breaking.
  • 115a334 is a toss-up for me; it's written, I got to feel like a C-x M-c M-butterfly programmer, and it works as far as I've tested it. But, the LLVM lookup table that it replaces is very small, about 0x40 bytes.1 It meets a very strict "no flash, ever" target, which I think applies more to the RTC memory stuff (that I have only surface-level knowledge of) than to interrupt handling.

here be dragons

  • 34fc2fb, 54fc729, and d43df90 are not worth merging directly, not least because I don't think they actually have an effect on any consumers of the HAL outside the esp32c3/src/examples directory. They do demonstrate just how tricky it is to 1) predict and 2) validate whether a particular call graph is memory resident or not. Initially, I believed inspecting the rust functions and throwing a few #[inline(always)]s around would be enough to ensure that property under a wide range of conditions, but what I found was... not that.
    • If there is an action here, I think it would be to consider updating the template project with some of the settings.
  • b511568 is a tough one: as you might expect, the care and feeding of a memcpy implementation is significantly more difficult than it first appears. Regrettably, it's just so frickin' good: memcpy (and to a lesser degree, memset, memmove, and memcmp) show up a lot: even at the highest optimization levels, there are 9 memcpy call sites in the gpio_interrupt example alone. In my downstream project, we've got ~50 memcpy calls that span the gamut from "one time, at startup" to "hottest of the hot path." Implementing that family of functions ourselves was a tremendous performance win across the board, and is well worth the high price of maintenance, at least for us.
    • Probably the thing to do is defer, at least for now: the esp-hal side of the IRQ flow only uses memset/memcpy when dealing with the [u128; 16] array that represents all configured interrupts, so it seems plausible that finding a different way to represent that data structure or fetching the priorities a few at a time instead of all at once might dodge the use of those functions, which here would allow us to meet the "nothing in flash" requirement without taking ownership of them.
    • I'm a little bit at a loss as to where or whom we should defer to, though: is this something that the rust-embedded folks would be interested in? It seems like a lot of microcontrollers are going to be in the same boat as the esp32c3 here, no?
  • after all that, 8efc0b9 is both anti-climactic and a little bit distressing: on the plus side, it's short! On the minus side, that's because the example is trivial, and what little of it there is points towards something I've been avoiding mentioning thus far: I have no idea what to do about core. The first and most immediate place that arises is, as here, in string formatting. However, it's not actually possible to work around in that way, because formatting an error message is a huge part of calls to panic!, most of which get emitted automatically for e.g. bounds checks on arrays, unwap calls, arithmetic overflow (when enabled), etc. Mostly this is 🔥 fine 🔥 because panic!s are generally non-recoverable, so taking extra time to load a string from flash right before the CPU goes into a sleep-until-reset loop isn't really a big deal. That said, it feels to me like a potentially large area of future work, with an uncertain scope and level of risk. I'm very curious what y'all's thoughts are on both how important it is to get the panic handling bits fully in RAM and whether there's any strategies that come to mind about how to do it.

caveats

  • the above applies only to the interrupt trap flow, I haven't yet examined the exception handling flow, nor the atomics emulation that's part of that. I haven't measured, yet, but that seems like a worthwhile unit of future work too.
  • and, finally, I don't want to forget about moving EspDefaultHandler, ExceptionHandler, and DefaultInterruptHandler; those'd be separate PRs against esp-riscv-rt and esp-backtrace (probably), but I haven't put much thought yet into helping out downstream crates that might provide their own implementations.

Phew, that was a whole lot! Let me know if there's a time y'all want to hop on a video call or if there's a semi-synchronous chat I should join to go through this together! I don't expect this to be our last opportunity to connect, though, so no pressure from my end either way.

cc @MabezDev

Footnotes

  1. I can't for the life of me identify the object in the ELF, but from looking at the hex dump of .rodata it's tucked in between two fairly obvious strings that at least provide an upper bound on the size.

sethp added 9 commits May 10, 2023 17:24
Previously, the trap vector itself and its immediate callees
(`_start_trap` and `_start_trap_rust_hal`) were located in the mapped
instruction flash range `0x420..`, increasing cache pressure and adding
variable latency to the very beginning of the interrupt/exception
service flow.

This change places those routines in iram directly:

```
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 48177: 40380280  2428 FUNC    GLOBAL DEFAULT    6 _start_trap_rust_hal
 48197: 40380bfc    54 FUNC    GLOBAL DEFAULT    6 _start_trap_rust
 48265: 40380200     0 FUNC    GLOBAL DEFAULT    6 _vector_table
 48349: 40380100     0 NOTYPE  GLOBAL DEFAULT    6 default_start_trap
 48350: 40380100     0 NOTYPE  GLOBAL DEFAULT    6 _start_trap
```

As seen via `readelf -W -s -C ./target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt | grep -E _start_trap\|_vector\|Ndx`
Previously, the `INTERRUPT_TO_PRIORITY` constant would compile down to a
runtime lookup table called `anon.HASH.2.llvm.NUMBERS` like so:

```asm
            status & configured_interrupts[INTERRUPT_TO_PRIORITY[cpu_intr as usize - 1]];
...
40380fbc:       3c0095b7                lui     a1,0x3c009
40380fc0:       2d858593                addi    a1,a1,728 # 3c0092d8 <anon.73967c570b981c3046610ed0622f71c3.2.llvm.13737670072012269642>
40380fc4:       952e                    add     a0,a0,a1
40380fc6:       4108                    lw      a0,0(a0)
```

Which means that in order to load the `a1`th slot from the LUT, we'll
have to go out to flash to do it.

As an alternative to an `#[inline]`'d function, which is nice in the
'c3's case because it's the identity function that disappears at
runtime, instead of using a `const` INTERRUPT_TO_PRIORITY could be made
a `static`, which would allow us to place it into the `.data` section
via a linker directive.
Prior, the code for `intrinsics::cttz` (e.g. `u128.trailing_zeros()`)
was being inlined into the call site, but that code was referencing a
lookup table from the `.rodata` section (placed in flash):

```
40380xxx:       3c0066b7                lui     a3,0x3c006
40380xxx:       ed168293                addi    t0,a3,-303 # 3c005ed1 <str.1+0x81>
40380xxx:       03810813                addi    a6,sp,56
...
40380xxx:       41e00633                neg     a2,t5
40380xxx:       00cf7633                and     a2,t5,a2
40380xxx:       03160633                mul     a2,a2,a7
40380xxx:       826d                    srli    a2,a2,0x1b
40380xxx:       9616                    add     a2,a2,t0
40380xxx:       00064603                lbu     a2,0(a2)
```

Caution: The `str.1` symbol name there is a red herring: `str.1` has a
length of just 0x21 bytes, there's just seemingly no symbolic name
associated with the jump table and `str.1` happens to be the closest
preceding label.

I can't claim credit for the `ctz32` algorithm here, all I've done is
copy it from Wikipedia and modify the operations to the Rust
equivalents, and then applied it four times to extend it to a `u128`. I
have tested it both for `0b0..010..0` (single bit set) and `0xff...100`
(all non-trailing bits set) for every bit position, which did catch at
least two mistakes in the translation.
Previously, as RODATA this was getting placed in flash, so this lookup
via `interrupt::riscv::vectored::handle_interrupt`:

```rust
let handler = peripherals::__EXTERNAL_INTERRUPTS[interrupt as usize]._handler;
```

was compiling to:

```
40380f1e:       3c006537                lui     a0,0x3c006
40380f22:       63c50c93                addi    s9,a0,1596 # 3c00663c <__EXTERNAL_INTERRUPTS>
...
40381112:       9566                    add     a0,a0,s9
40381114:       410c                    lw      a1,0(a0)
```

which dereferences a pointer into the flash segment `0x3c0...`.

After this change to apply the linker directive from the corresponding
[commit to the pac crate][pac-commit]
o   Num:    Value  Size Type    Bind   Vis      Ndx Name
 48188: 3fc80608   248 OBJECT  GLOBAL DEFAULT    9 __EXTERNAL_INTERRUPTS

```
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 48188: 3fc80608   248 OBJECT  GLOBAL DEFAULT    9 __EXTERNAL_INTERRUPTS
```

As seen by `readelf -W -s -C ./esp32c3-hal/target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt | grep -E __EXT\|Ndx`.

[pac-commit]: rustbox/esp-pacs@c8e7cbb
This change gets a fair bit less clear; removing `#[ram]` from
`handle_interrupt` does indeed change the resulting binary (because it's
implicitly removing `#[inline(never)]`, but `get_core` is so small a
function LLVM will always inline it at all the optimization levels I've
tried, whether it's marked `#[inline]` or not.

So rather than a specific solution to an actual problem, this patch is
more of an example and a warning for what's to come: lacking a precise
mechanism to trace and identify properties of all callees along a
particular path, everything from here on out is highly dependent on
relatively distant effects: in attempting to assert whether "the trap
handling flow is entirely located in SRAM" we necessarily have to
qualify the answer with "given these optimization settings" (and, soon,
much worse).

If correcteness here is "not touching flash," then, in the usual
language of Rust, we're about to leave attempts at demonstrating
soundness behind. Instead, we now shift our focus to showing safety
under a particular set of (somewhat fragile) circumstances.
As with the prior change, the problem we're solving here is shrinking
the state space: there are many branch points within the optmization
machinery, and these settings align with what I've been looking at most
thoroughly.

NB: I have no idea what effect, if any, the Cargo.toml settings have on
downstream crates. Forcing frame pointers in `.cargo/config.toml`
definitely has no effect.
Removes some flash accesses that occur as the result of an optimization
when mapping a sequence of integers to an action, such as calling
handler `interrupt7` for interrupt code 7.

Previously, this would compile to something similar to:

```asm
40380538:       34202573                csrr    a0,mcause
4038053c:       0506                    slli    a0,a0,0x1
4038053e:       8105                    srli    a0,a0,0x1
        match code {
40380540:       157d                    addi    a0,a0,-1
40380542:       45f9                    li      a1,30
40380544:       2ca5e163                bltu    a1,a0,40380806 <.LBB2_7+0x2ae>
40380548:       050a                    slli    a0,a0,0x2
4038054a:       3c0075b7                lui     a1,0x3c007
4038054e:       1dc58593                addi    a1,a1,476 # 3c0071dc <.LJTI2_0>
40380552:       952e                    add     a0,a0,a1
40380554:       4108                    lw      a0,0(a0)
40380556:       8502                    jr      a0
```

followed by a bunch of entries corresponding to various `interruptN`
calls. `.LJTI2_0` there is stored in flash, and so this requires some
amount of Flash access and/or induces cache pressure.

Unfortunately, LLVM's ELF lowering is fairly determined to emit these
jump tables as part the [`.rodata` section][elf-lowering-jt-sec], which
makes them difficult to specifically relocate to SRAM on their own to
preserve the optimization. So, this instead disables the otpimzation
globally (for all crates and all paths), but only for examples built
directly from the HAL (downstream crates would need to set this setting
themselves, as with force-frame-pointers).

[elf-lowering-jt-sec]: https://llvm.org/doxygen/TargetLoweringObjectFileImpl_8cpp_source.html#l00942
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
The culmination of this entire patch series, we now place the interrupt
handling code itself into SRAM. The `#[ram]` attribute is for the `GPIO`
function itself (and `#[inline(never)]` works fine here because LLVM
can't "see" through the ~indirect reference to this function), but then
that leaves the matter of the function body itself.

Unfortunately, this is necessarily work specific to the details of each
handler's implementation. Elsewhere, I got started down this road
because we needed SPI to start up within a certain time bound, and so
we needed to add attributes to much of the HAL's SPI implementation.
And, in addition to `memset` and `memcpy` we had to override `memmove`
as well, because something about the datastructures used for DMA made
`rustc` use that in a few places.

As far as this particular example goes, touching `core::fmt` pulls in a
lot of code from fairly far outside our known methods of control:

```
        write(&mut self, args)
40380f00:       3c006537                lui     a0,0x3c006
40380f04:       7a850513                addi    a0,a0,1960 # 3c0067a8 <.Lanon.6ccc2de621f5491d2d69154e15b03ba0.16>
40380f08:       fca42c23                sw      a0,-40(s0)
40380f0c:       4505                    li      a0,1
40380f0e:       fca42e23                sw      a0,-36(s0)
40380f12:       3c006537                lui     a0,0x3c006
40380f16:       6a450513                addi    a0,a0,1700 # 3c0066a4 <anon.f0eba740dd2cb5e3ab050cbaac366ec5.84.llvm
.5894916478550538780>
40380f1a:       fea42023                sw      a0,-32(s0)
40380f1e:       fe042223                sw      zero,-28(s0)
40380f22:       fe042423                sw      zero,-24(s0)
40380f26:       3c006537                lui     a0,0x3c006
40380f2a:       5fc50593                addi    a1,a0,1532 # 3c0065fc <anon.7f39784d2cd27331f7ef17a88d595274.0.llvm.
10122694886931673864>
40380f2e:       fd440513                addi    a0,s0,-44
40380f32:       fd840613                addi    a2,s0,-40
40380f36:       01c82097                auipc   ra,0x1c82
40380f3a:       92a080e7                jalr    -1750(ra) # 42002860 <core::fmt::write>
```

So, this change circumvents the convenience machinery to call
`write_str` directly, thereby avoiding touching `core::fmt::write` and
its anonymous helper data. Curiously, with the combination of
optimization settings in use, doing so actually erases the `"GPIO
Interrupt\n"` string entirely in favor of an inlined, unrolled loop
emitting each byte in succession:

```asm
40380efe:       40000537                lui     a0,0x40000
40380f02:       06850493                addi    s1,a0,104 # 40000068 <ets_delay_us+0x18>
40380f06:       04700513                li      a0,71
40380f0a:       9482                    jalr    s1
40380f0c:       05000513                li      a0,80
40380f10:       9482                    jalr    s1
...
```

(for those, like me, who have not memorized the decimal->ACII lookup
table, 71 is 'G' and 80 is 'P')

NB: LTO is _required_ at this stage: turning it off produces calls to
things like `_critical_section_1_0_{acquire,release}` and some `Mutex`
interactions. The latter might be fixable, but I didn't look into it
because the former is seemingly not: the `extern "Rust"` seems to form a
boundary that currently can only be pierced by applying LTO.
@bjoernQ
Copy link
Contributor

bjoernQ commented May 15, 2023

relates to #43

@MabezDev
Copy link
Member

@sethp thanks for looking into this! This is really useful experimentation. I won't address the whole discussion at once as there is a lot to digest.

Commits 1 & 3 seem like no-brainers, and I don't think will have any issues being merged. Commit 2 looks okay, but the commit seems to have an undocumented change in that it is no longer subtracting one from the cpu_intr variable. I'm less keen on commit 4, perhaps it's possible to override some intrinsics (maybe they have weak linkage? I added something like this for the mem instrinsics: rust-lang/compiler-builtins#411) instead of having custom implementations inside esp-hal.

@onsdagens has also been looking into improving interrupt performance for the C3 for the RTIC port. We found that we're not actually using CPU vectoring which could eliminate this look up & function call completely, have you looked into that at all?

Phew, that was a whole lot! Let me know if there's a time y'all want to hop on a video call or if there's a semi-synchronous chat I should join to go through this together! I don't expect this to be our last opportunity to connect, though, so no pressure from my end either way.

Sure! We can start via matrix if that works for you; feel free to DM me or @bjoernQ . If we feel the need to jump on a call later, we certainly can.

@sethp
Copy link
Contributor Author

sethp commented May 16, 2023

@onsdagens has also been looking into improving interrupt performance for the C3 for the RTIC port. We found that we're not actually using CPU vectoring which could eliminate this look up & function call completely, have you looked into that at all?

Ohhh yes, this is a very excellent point: a much better strategy than turning off jump tables is to use the existing hardware jump table (at least for the HAL entry bits; IIRC we still had more jump tables to deal with inside the ISR's callees). I have a prototype of this somewhere that I'll try to dust off this week.

I'm less keen on commit 4, perhaps it's possible to override some intrinsics (maybe they have weak linkage? I added something like this for the mem instrinsics: rust-lang/compiler-builtins#411) instead of having custom implementations inside esp-hal.

Oh, good thought: I'll look into this too. It's a little trickier to override since it's not in compiler-builtins but deep in LLVM.

@bjoernQ
Copy link
Contributor

bjoernQ commented May 16, 2023

I wonder if it would be possible to place intrinsics code (if not inlined) and data into SRAM via some black-magic in linker-scripts?

Also as a side-note: Probably all this should be behind a feature flag (maybe a default feature). The thing is that e.g. in the matter-rs experiment we are already hitting RAM limits (e.g. increasing some network buffers by just a few bytes makes it stop working currently) - for sure there is a lot potential for optimization there but for things like that interrupt latency isn't much of a concern while RAM usage is.

@sethp
Copy link
Contributor Author

sethp commented May 17, 2023

I wonder if it would be possible to place intrinsics code (if not inlined) and data into SRAM via some black-magic in linker-scripts?

I'm not sure about the code, but it's inlined here anyway: the trouble is the lookup table that code uses. That table is seemingly a totally unnamed symbol, and there doesn't appear to be any linker invocation that's of the shape "referents from this named root" (which would be handy for making basically everything in this PR unnecessary!). The closest I could see from linker-land was "take everything in the .rodata section of the libcore object and put it in this output section", which I didn't really consider because I just assumed that would use too much RAM to be useful.

Speaking of, your addenda is well taken: I agree that putting these allocations behind some kind of conditional would make sense (especially as these chips seem quite capable of executing interrupt handlers entirely from flash). I'm a little hesitant to use a cargo feature for it since the granularity there is fairly coarse.

I wonder if we could defer the question? It looks like the amount allocated here isn't very much in a relative sense (about 1% each of DRAM and IRAM):

     Running `target/debug/mem ../esp-hal/esp32c3-hal/target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt`
DRAM allocation:
addr                    size    (    bytes    )        %        name
0x3fc80000-0x3fc80118   0x00118 (   280.000B  )   0.085%        .data
0x3fc80118-0x3fc81728   0x01610 (     5.516KiB)   1.716%        .rwtext.dummy
0x3fc81728-0x3fc81738   0x00010 (    16.000B  )   0.005%        .bss
0x3fc81738-0x3fc81738   0x00000 (     0.000B  )   0.000%        .uninit
0x3fc81738-0x3fc81738   0x00000 (     0.000B  )   0.000%        .heap
0x3fc81738-0x3fcd0000   0x4e8c8 (   314.195KiB)  97.728%        .stack
                total   0x50000 (   320.000KiB)  99.533%        —
IRAM allocation:
addr                    size    (    bytes    )        %        name
0x40380120-0x40381720   0x01600 (     5.500KiB)   1.378%        .rwtext
                total   0x01600 (     5.500KiB)   1.378%        —

vs the baseline from the hal's current main:

DRAM allocation:
addr                    size    (    bytes    )        %        name
0x3fc80000-0x3fc80000   0x00000 (     0.000B  )   0.000%        .data
0x3fc80000-0x3fc808fc   0x008fc (     2.246KiB)   0.699%        .rwtext.dummy
0x3fc808fc-0x3fc8090c   0x00010 (    16.000B  )   0.005%        .bss
0x3fc8090c-0x3fc8090c   0x00000 (     0.000B  )   0.000%        .uninit
0x3fc8090c-0x3fc8090c   0x00000 (     0.000B  )   0.000%        .heap
0x3fc8090c-0x3fcd0000   0x4f6f4 (   317.738KiB)  98.830%        .stack
                total   0x50000 (   320.000KiB)  99.533%        —
IRAM allocation:
addr                    size    (    bytes    )        %        name
0x40380008-0x403808f4   0x008ec (     2.230KiB)   0.559%        .rwtext
                total   0x008ec (     2.230KiB)   0.559%        —

(these generated with the very rough, very 'c3 specific tool I've been using for analyzing allocations; I threw it up here if you want to check it out: https://github.com/rustbox/supreme-train/blob/main/src/bin/mem.rs ).

@sethp
Copy link
Contributor Author

sethp commented May 19, 2023

Happy Friday! I've got a brief update for you on two items here:

Hardware Vectoring

A draft of hardware vectoring PoC is up, which at this point is almost entirely in esp-riscv-rt: esp-rs/esp-riscv-rt#11 . It works, but I was hoping to spend a little more time carrying the thought through to the esp-hal crate before merging, since there's a whole release thing involved.

I'm fairly excited about how that's working out, though: I think it'll offer a nice way to sidestep some of the thornier issues in this PR, at least for the moment. If we do carry the vectoring all the way "through" instead of fanning back in at handle_interrupts, as that PR still does, I suspect we'll be able to defer the memcpy/memset and maybe even cttz questions for the purposes of the HAL machinery. Downstream code will still have the same shape of problem, but at least has one or two more tools to solve them.

Also, I'm feeling particularly clever about moving a little away from KEEP in #541, since that'll hopefully simplify providing but not mandating the use of the 31 snazzy new entrypoint routines to thread the needle of the space/performance tradeoff that @bjoernQ raised.

svd2rust

I got the ball rolling on integrating the linker directive for __INTERRUPTS all the way uphill. There's also a corresponding change to the esp-pacs repo that I don't expect y'all to merge (crates.io forbids [patch] sections in released crates, right?) but I thought you might like the opportunity to see how it's coming along anyway.

I'm also on matrix now as @sethp:matrix.org, so feel free to hit me up with any questions over there too. Thanks, and have a good weekend!

@MabezDev
Copy link
Member

Jumping back into this thread after the initial merges + some travelling on my end to address the other parts you brought up :).

I am on board with adding optimized memcpy routines (b511568) I think we'd see a net benefit in performance across all applications for little cost. Perhaps one experiment you may wish to try would be to use the built-in ROM memcpy routines, the CPU should have one cycle access (though I've asked internally for some clarification on this) to ROM code; I'm also not sure if ROM code is executed in place or is pulled into the cache. If it avoids hitting the cache whilst having the same execution performance as SRAM that could be a win.

However, it's not actually possible to work around in that way, because formatting an error message is a huge part of calls to panic!, most of which get emitted automatically for e.g. bounds checks on arrays, unwap calls, arithmetic overflow (when enabled), etc. Mostly this is 🔥 fine 🔥 because panic!s are generally non-recoverable, so taking extra time to load a string from flash right before the CPU goes into a sleep-until-reset loop isn't really a big deal. That said, it feels to me like a potentially large area of future work, with an uncertain scope and level of risk. I'm very curious what y'all's thoughts are on both how important it is to get the panic handling bits fully in RAM and whether there's any strategies that come to mind about how to do it.

This pains me too, but sadly it seems we need language-level help to overcome this. I've seen a couple of folks poke around the rustc formatting machinery, and subsequently burn out pretty fast. I think the only person with recent success with that was @m-ou-se, though I forget the exact PR they submitted.

You are very well versed in Rust and all its intricacies so I presume you have tried this, but how does panic_immediate_abort affect panic's within these RAM only routines? In theory on panic it should just call core::intrinsic::abort(), so the compiler should optimize out any formatting machinery.

@MabezDev
Copy link
Member

Perhaps one experiment you may wish to try would be to use the built-in ROM memcpy routines, the CPU should have one cycle access (though I've asked internally for some clarification on this) to ROM code; I'm also not sure if ROM code is executed in place or is pulled into the cache. If it avoids hitting the cache whilst having the same execution performance as SRAM that could be a win

I've confirmed that ROM code is executed in place and won't touch the cache.

@MabezDev
Copy link
Member

@sethp any thoughts on my comments above?

@jessebraham
Copy link
Member

Closing this due to inactivity. Please feel free to re-open this PR, or to open a new one, if you wish to return to this work in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants